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Chapter 11 
WHY MODEL? 



Do not worry. No more chapter introductions around the effect of your looking up on other people's looking up. 
We squeezed that example well beyond what seemed possible. In Part II of this book, most examples involve real 
data. The data sets can be downloaded from the book's web site. 

Part I was mostly conceptual. Calculations were kept to a minimum, and could be carried out by hand. In 
contrast, the material described in Part II requires the use of computers to fit regression models, such as linear 
and logistic models. Because this book cannot provide a detailed introduction to regression techniques, we assume 
that readers have a basic understanding and working knowledge of these commonly used models. Our web site 
provides SAS programs to replicate all analyses described in the text (check the code margin notes). Our web 
site also provides links to other sites from which some STATA and R programs can be obtained. 

This chapter describes the differences between the nonparametric estimators used in Part I and the parametric 
(model-based) estimators used in Part II. It also reviews the concept of smoothing and, briefly, the bias-variance 
trade-off involved in any modeling decision. The chapter motivates the need for models in data analysis, regardless 
of whether the analytic goal is causal inference or, say, prediction. We will take a break from causal considerations 
until the next chapter. 



11.1 Data cannot speak for themselves 

Consider a study population of 16 individuals infected with the human im- 
munodeficiency virus (HIV). Unlike in Part I of this book we will not view 
these individuals as representatives of 1 billion individuals identical to them. 
Rather, these are just 16 individuals randomly sampled from a large, possibly 
hypothetical super-population: the target population. 

At the start of the study each individual receives a certain level of a treat- 
ment A (antiretroviral therapy), which is maintained during the study. At the 
end of the study, a continuous outcome Y (CD4 cell count, in cells/mm"^) is 
measured in all individuals. We wish to consistently estimate the mean of Y 
among subjects with treatment level A = a, in the population from which the 
16 subjects were randomly sampled. That is, the estimand is the unknown 
population parameter E[F|A = a]. 

An cistimatoT E[Y\A = a] oiFi[Y\A = a] is some function of the data that is 
used to estimate the unknown population parameter. Informally, a consistent 
See Chapter 10 for a rigorous defi- estimator E[F|j4 = a] meets the requirement that "the larger the sample size, 
nition of a consistent estimator. the closer the estimate to the population value E[y|y4 = a]." Two examples of 

possible estimators E[y|A = a] are (i) the sample average of Y among those 
receiving A = a, and (ii) the value of the first observation in the dataset that 
happens to have the value A = a. The sample average of Y among those 
receiving A = a is a. consistent estimator of the population mean; the value of 
the first observation with vl = a is not. In practice we require all estimators 
to be consistent, and therefore we use the sample average to estimate the 
population mean. 

Suppose treatment vl is a dichotomous variable with two possible values: no 
treatment {A = 0) and treatment {A = 1). Half of the individuals were treated 



Causal Inference 



{A = 1). Figure 11.1 is a scatter plot that displays each of the 16 individuals 
as a dot. The height of the dot indicates the value of the individual's outcome 
Y. The 8 treated individuals are placed along the column A — 1, and the 
8 untreated along the column ^4 = 0. An estimate of the mean of Y among 
subjects with level A — a m the population is the numerical result of applying 
the estimator (i.e., the sample average) to a particular data set. 

Our estimate of the population mean in the treated is the sample average 
146.25 for those with A = 1, and our estimate of the population mean in the 
imtrcatcd is the sample average 67.50 in those with A = 0. Under exchange- 
ability of the treated and the untreated, the difference 146.25 — 67.50 would 
be interpreted as an estimate of the average causal effect of treatment A on 
the outcome Y in the target population. However, this chapter is not about 
making causal inferences. Our goal is simply to motivate the need for models 
when trying to estimate population quantities like E[y|A = a], irrespective of 
whether they do or do not have a causal interpretation. 

Now suppose treatment ^4 is a poljdiomous variable that can take 4 possible 
values: no treatment {A = 1), low-dose treatment = 2). medium-dose treat- 
ment (A = 3), and high-dose treatment = 4). A quarter of the individuals 
received each treatment level. Figure 11.2 displays the outcome value for the 
16 individuals in thc^ study population. To estimate the population means in 
the 4 groups defined by treatment level, we compute the corresponding sample 
averages. The estimates are 70.0, 80.0, 117.5, and 195.0 for A = 1, A = 2, 
A = 3, and A = 4, respectively. 

Figures 11.1 and 11.2 depict examples of discrete (categorical) treatment 
variables with 2 and 4 categories, respectively. Because the number of study 
subjects is fixed at 16, the number of subjects per category decreases as the 
number of categories increase. The sample average in each category is still 
a consistent estimator of the corresponding population mean, but the proba- 
bility that the sample average is close to the corresponding population mean 
decreases as the number of subjects in each category decreases. The length of 
the 95% confidence intervals (see Chapter 10) for the category-specific means 
will be greater for Figure 11.2 than for Figure 11.1. 

Finally suppose that treatment A is variable representing the dose of treat- 
ment in mg/day, and that it takes integer values from 0 to 100 mg. Figure 
11.3 displays the outcome value for each of the 16 individuals. Because the 
number of possible values of treatment is much greater than the number of in- 
dividuals in the study, there are many values of A that no individual received. 
For example, there are no individuals with treatment dose A = 90 in the study 
population. 

This creates a problem: how can we estimate the mean of Y among subjects 
with treatment level A = 90 in the target population? The estimator we used 
for the data in Figures 11.1 and 11.2 — the treatment-specific sample average — 
is undefined for treatment levels in Figure 11.3 for which there are no subjects. 
If treatment A were a truly continuous variable, then the sample average would 
be undefined for nearly all treatment levels. (A continuous variable A can be 
viewed as a categorical variable with an infinite number of categories.) 

The above description shows that in some settings we cannot let the data 
"speak for themselves" to obtain a consistent estimate. Rather, we need to 
supplement the data with a model as described in the next section. 
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11.2 Parametric estimators 



We want to consistently estimate the mean of Y among subjects with treatment 
level A = 90, i.e., E[y|^ = 90], from the data in Figure 11.3. Suppose we 
expect the mean of Y among subjects with treatment level A = 90 to lie 
between the mean among subjects with A = 80 and the mean among subjects 
with A = 100. In fact, suppose we knew that the treatment-specific population 
mean of Y is proportional to the value of treatment A throughout the range 
of A. More precisely, we know that the mean of Y, E[Y\A], increases (or 
decreases) from some value tor A = 0 by 6i units per unit of A. Or, more 
compactly, 



E[Y\A] = + OiA 



More generally, the restriction on 
the shape of the relation is known 
as the functional form. 
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Figure 11.4 

CODE: Program 11.2 
Under the assumption that the vari- 
ance of the residuals does not de- 
pend on A (homoscedasticity), the 
Wald 95% confidence intervals are 
(-21.2,70.3) for Oq, (1.28,2.99) 
for 6*1, and (172.1,261.6) for 
E[F|A = 90]. 



This equation is a restriction on the shape of the relation between treatment 
level and the mean of the outcome, i.e., the dose- response curve. This particu- 
lar restriction is referred to as a linear mean m,odel, and the quantities and 
0\ are referred to as the parameters of the model. Models that describe the 
dose-response curve in terms of a finite number of parameters are referred to as 
parametric models. In our example, the parameters 0^ and 6i define a straight 
line that crosses (intercepts) the vertical axis at and that has a slope Ox. 
That is, the model specifies that all possible dose-response curves are straight 
lines, though their intercepts and slopes may vary. 

We are now ready to combine the data in Figure 11.3 with our parametric 
model to estimate E[y|A = a] for all values a from 0 to 100. The first step is to 
obtain consistent estimates 6^ and 9i of the parameters 6q and 6i . The second 
step is to use these estimates to estimate the mean of Y for any value A = a. 
For example, to estimate the mean of Y among subjects with treatment level 
A = 90, we use the expression E[Y\A = 90] = + 90^i. The estimate F.[Y\A] 
for each individual is referred to as the predicted value. 

The parameters and 9i can be consistently estimated by the method of 
ordinary least squares estimation. A nontechnical motivation of the method 
follows. Consider all possible candidate straight lines for Figure 11.3, each of 
them with a different combination of values of intercept and slope 9\. For 
each candidate line, one can calculate the vertical distance from each dot to 
the line (the residual), square each of those 16 residuals, and then sum the 
16 squared residuals. The line for which the sum is the smallest is the "least 
squares" line, and the parameter values 6[) and 9i of this "least squares" line 
are the "least squares" estimates. The values and 9i can be easily computed 
using linear algebra, as described in any statistics textbook. 

In our example, the parameter estimates are — 24.55 and 9i — 2.14, 
which define the straight line shown in Figure 11.4. The predicted mean of 

Y among subjects with treatment level A = 90 is therefore E[y|A = 90] = 
24.55 + 90 X 2.14 = 216.9. Because ordinary least squares estimation uses all 
data points to find the best line, the mean of Y in the group A = a, i.e., 
E[F|vl = a], is estimated by borrowing information from subjects who have 
values of treatment A not equal to a. 

So what is a model? A model is an a priori restriction on the distribution 
of the data. Our linear model says that the dose-response curve is a straight 
line, which restricts its shape. For example, the model says that the mean of 

Y iov A = 90 restricts its value to be between the mean of Y lov A = 80 and 
the mean of Y iov A = 100. This restriction is encoded by parameters like 9q 
and 9i . A parametric model is like adding information that is not in the data 
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to compensate for the lack of sufficient information in the data themselves. 

Parametric estimators — those based on parametric models — allow us to 
consistently estimate quantities that cannot be consistently estimated other- 
wise, e.g., the mean of Y among subjects with treatment level A = 90. But 
this is not a free lunch. When using a parametric model, the inferences are cor- 
rect only if the restrictions encoded in the model are correct, i.e. if the model 
is correctly specified. Thus model-based causal inference — ^to which the re- 
mainder of this book is devoted — relies on the condition of (approximately) no 
model misspecification. Because parametric models are rarely, if ever, perfectly 
specified, certain degree of model misspecification is almost always expected. 



11.3 Nonparametric estimators 



code: Program 11.2 



In this book we define "model" as 
an a priori mathematical restric- 
tion on the possible states of nature 
(Robins, Greenland 1986). Part I 
was entitled "Causal inference with- 
out models" because it only de- 
scribed saturated models. 



A saturated model has the same 
number of unknowns in both sides 
of the equal sign. 



Let us return to the data in Figure 11.1. Treatment A is dichotomous and we 
want to consistently estimate the mean of Y in the treated E[y|A = 1] and in 
the untreated £[1^1 A = 0]. Suppose we have become so enamored with models 
that we decide to use one to estimate these two quantities. Again we proposed 
a linear model 

E[Y\A] = 00 + eiA 

where E[F|A = 0] = 60 + 0x61 = 6*0 andE[y|A = 1] = 6o + lx6i = 60+61. We 
use the least squares method to obtain consistent estimates of the parameters 
and 61. These estimates are 60 = 67.5 and 61 = 78.75. We therefore estimate 
E[Y\A = 0] = 67.5 and E[Y\A = 1] = 146.25. Note that our model-based 
estimates of the mean of Y are identical to the sample averages we calculated 
in Section 11.1. This is not a coincidence but an expected finding. 

Let us take a second look at the model E[F|j4 = a] — 60 + 61A with a 
dichotomous treatment A. If we rewrite the model as E[y |A = 1] = E[y|A = 
0] + 6\, we see that the model simply states that the mean in the treated 
E[y[^ = 1] is equal to the mean in the untreated E[y|j4 = 0] plus a quantity 
61 , where 61 may be negative, positive or zero. But this statement is of course 
always true! The model imposes no restrictions whatsoever on the values of 
E[Y\A = 1] and E[Y\A = 0]. Therefore E[y|^ = a] = 6^ + 61A with a 
dichotomous treatment A is not a model because it lets the data speak for 
themselves, just like the sample average does. "Models" which do not impose 
restrictions are saturateA models. Because they formally look like models even if 
they do not fit our definition of model, saturated models are ordinarily referred 
to as models too. 

Whenever the number of parameters in the model is equal to the number 
of population quantities that can be estimated by using the model, then the 
model is saturated. For example, the linear model E[y|y4] =60 + 6\A has 
two parameters and, when A is dichotomous, estimates two quantities: the 
means E[y|A = 1] and E[y|A = 0]. Since the values of the two parameters 
are not restricted by the model, neither are the values of the means. As a 
contrast, consider the data in Figure 11.3 where A can take values from 0 to 
100. The linear model E[y|A] — 6o+6iA has two parameters but estimates 101 
quantities, i.e., E[Y\A = 0],E[y|A = l],...,E[y|A = 100]. The only hope for 
consistently estimating 101 quantities with two parameters is to be fortunate 
to have all 101 means E[y|A = a] lie along a straight line. When a model has 
only a few parameters but it is used to estimate many population quantities, 
we say that the model is parsimonious. 
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Nonparametric estimators are those that produce estimates from the data 

withoTit any a priori restrictions on the functional form of the estimate. An 
example of nonparametric estimator of the population mean E[y|A = a] for 
a dichotomous treatment is its empirical version, the sample average. Or, 
equivalently, the saturated model described in this Section. All methods for 
causal inference that we described in Part I of this book — standardization, IP 
weighting, stratification, matching — were based on nonparametric estimators 
because they did not impose any a priori restrictions on the value of the effect 
estimates. In contrast, all methods for causal inference described in Part II of 
this book rely on estimators that are (at least partly) parametric. Parametric 
estimation and other approaches to borrow information are our only hope 
when, as is often the case, data are unable to speak for themselves. 



11.4 Smoothing 



Caution: Often the term "linear" is 
used with two different meanings. 
A model is linear when is expressed 
as a linear combination of parame- 
ters and variables. A term in the 
model is linear when it defines the 
slope of a straight line in the ab- 
sence of a quadratic term. 



code: Program 11.3 
Under the homoscedasticity as- 
sumption, the Wald 95% confi- 
dence interval for E[Y\A = 90] is 
(122.2,602.3). 



Consider again the data in Figure 11.3 and the linear model E[y|A] = 9o+9iA. 
The parameter 9\ is the difference in mean outcome per unit of treatment dose 
A. Because 6i is a single number, the model specifies that the difference in 
mean outcome Y per unit of treatment A must be constant throughout the 
entire range of A, that is, the model requires the conditional mean outcome to 
follow a straight line as a function of treatment dose A. Figure 11.4 shows the 
best-fitting straight line. 

But one can imagine situations in which the difference in mean outcome is 
larger for a one-unit change at low doses of treatment, and smaller for a one- 
unit change at high doses. This would be the case if, once the treatment dose 
reaches certain level, higher doses have an increasingly small effect. Under 
this scenario, the model E[y|A] = 9a + 9iA is incorrect. However, linear 
models can be made more flexible. For example, suppose we fit the model 
E[y|A] = ^0 + OiA + ^2^^, where A^ = A x A is A-squared, to the data in 
Figure 11.3. This is still referred to as a linear model because the conditional 
mean is expressed as a linear combination, i.e., as the sum of the products of 
each covariate {A and A"^) with its associated coefficient (the parameters 6i and 
92) plus an intercept (f?o)- However, whenever 92 is not zero, the parameters 
^0, 9i, and 92 now define a curve — a parabola — rather than a straight line. We 
refer to 6^ as the parameter for the linear term A, and to 62 as the parameter 
for the quadratic term A^. 

The curve under the 3-parameter linear model E[y |A] = 9o + 9iA + 92A'^ 
can be found via ordinary least squares estimation applied to the data in 
Figure 11.3. The estimated curve is shown in Figure 11.5. The parameter 
estimates are 9o = —7.41, 9i = 4.11, and ^2 = —0.02. The predicted mean of 
Y among subjects with treatment level A = 90 is obtained from the expression 
E[Y\A = 90] = ^0 + 90^1 + 90 X 90^2 = 197.1. 

We could keep adding parameters for a cubic term {9s A^'^), a quartic term 
(^4^^)... until we reach a 15th-degree term {6ir,A^^). At that point the number 
of parameters in our model equals the number of data points (individuals). The 
shape of the curve would change as the number of parameters increases. In 
general, the more parameters in the model, the more inflection points will 
appear. That is, the curve generally becomes more "wiggly," or less smooth, 
as the number of parameters increase. A model with 2 parameters — a straight 
line — is the smoothest model (a line with no inflection points). A model with 
as many parameters as data points is the least smooth model (as many possible 
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inflection points as data points). 

Modeling can be viewed as a procedure to transform noisy data into more 
or less smooth curves. This smoothing occurs because the model borrows 
information from many data points to predict the outcome value at a particular 
combination of values of the covariates. The smoothing results from E[y|j4 = 
a] being estimated by borrowing information from subjects with A not equal 
to a. All parametric estimators incorporate some degree of smoothing. 

The degree of smoothing depends on how much information is borrowed 
across individuals. The 2-parameter model E[y|A] = 9q + 6\A estimates 
E[F|A = 90] by borrowing information from all individuals in the study pop- 
ulation to find the least squares straight line. A model with as many parame- 
ters as individuals does not borrow any information to estimate E[y|A] at the 
values of A that occur in the data, though it borrows information (by inter- 
polation) for values of A that do not occur in the data. Intermediate degrees 
of smoothing can be achieved by using an intermediate number of parameters 
or, more generally, by restricting the number of individuals that contribute to 
the estimation. For example, to estimate E[F|A = 90] we could decide to fit a 
2-parameter model E[F| A] = + restricted to individuals with treatment 
doses between 80 and 100. That is, we would only borrow information from 
individuals in a 10-unit window of A = 90. The wider the window around 
A = 90, the more smoothing would be achieved. 

In our simplistic examples above, all models included either one covariate 
A or two covariates A and A^ so that the curves can be represented on a two- 
dimensional book page. In realistic applications, models often include many 
different covariates so that the curves are really hyperdimensional surfaces. 
Regardless of the dimensionality of the problem, the concept of smoothing 
remains invariant: the fewer parameters in the model, the smoother the pre- 
diction (response) surface will be. 



11.5 The bias-variance trade-off 

In previous sections we have used the 16 individuals in Figure 11.3 to estimate 
the mean OTitcomc Y among people receiving a treatment dose of A = 90 in the 
target population, E[y|A = 90]. Since nobody in the study population received 
A = 90, we could not let the data speak for themselves. So we combined the 
data with a linear model. The estimate E[y|A = 90] varied with the model. 
Under the 2-parameter model E[y|A] = 6q + 0iA, the estimate was 216.9 (95% 
CI: 172.1, 261.6). Under the 3-parameter model E[F| A] = eo+exA + e2A^, the 
estimate was 197.1 (95% CI: 142.8, 251.5). We used two different parametric 
models that yielded two different estimates. Which one is better? Is 216.9 or 
197.1 closer to the mean in the target population? 

If the relation is truly curvilinear, then the estimate from the 2-parameter 
model will be biased because this model assumes a straight line. On the other 
hand, if the relation is truly a straight line, then the estimates from both models 
will be valid. This is so because the 3-parameter model E[y|A] = + + 
O^A^ is correctly specified whether the relation follows a straight line (in which 
case O2 = 0) or a parabolic curve (in which case 62 7^ 0). One safe strategy 
would be to use the 3-parameter model E[y | A] = 6q+6iA + 82 A"^ rather than 
the 2-parameter model E[y |A] = 0q + Q^A. Because the 3-parameter model is 
correctly specified under both a straight line and a parabolic curve, it is less 
likely to be biased. In general, the larger the number of parameters in the 
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Figure 11.5 

We used a model for continuous 
outcomes as an example. The same 
reasoning applies to models for di- 
chotomous outcomes such as logis- 
tic models (see Fine Point 11.1) 
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model, the fewer restrictions the model imposes; the less smooth the model, 

the more protection afforded against bias from model misspecification. 

Although less smooth models may yield a less biased estimate, they also 
result in a larger variance, i.e., wider 95% confidence intervals around the 
estimate. For example, the estimated 95% confidence interval around EfFjA = 
90] was much wider when we used the 3-parameter model than when we used 
the 2-parameter model. However, when the estimate E[y|j4 = 90] based on the 
2-parameter model is biased, the standard (nominal) 95% confidence interval 
will not cover the true parameter E[y |A = 90] 95% of the time. 

This bias-variance trade-off is at the heart of all data analyses. Investiga- 
tors using models need to decide whether some protection against bias — by, 
say, adding more parameters to the model — is worth the cost in terms of vari- 
ance. Though some formal procedures exist to aid these decisions, in practice 
many investigators decide which model to use based on criteria like tradition, 
interpretability of the parameters, and software availability. In this book we 
will usually assume that our parametric models are correctly specified. This 
is an unrealistic assumption, but it allows us to focus on the problems that 
are specific to causal analyses. Model misspecification is, after all, a problem 
that can arise in any sort of data analysis, regardless of whether the estimates 
are endowed with a causal interpretation. In practice, careful investigators will 
always question the validity of their models, and will conduct analysis to assess 
the sensitivity of their estimates to model specification. 

We are now ready to describe the use of models for causal inference. 
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Technical Point 11.1 

A taxonomy of commonly used models. The main text describes linear regression models of the form E[y|X] = 

p 

OX = OiXi where X is a vector of covariates Xo,Xi,...Xp with Xq = 1 for all n individuals. Linear regression 

i=0 

models are a subset of larger class of models: Generalized Linear Models or GLMs (McCullagh and Nelder 1989). GLMs 

p p 
have three components: a linear functional form ^ 9iXi, a link function g{-} such that g({E[y|X]} = ^ 9iXi, and 

i=0 i=l 

a distribution for the Y conditional on X. If we do not model the distribution of Y conditional on X, we refer to the 
model as a conditional mean model. Conditional mean models only specify a parametric form for E[y|X] but do not 
otherwise restrict the distribution of Y\X. These are the models we most commonly use in Part 11. 

The linear regression models described in the main text is a conditional mean model that uses the identity link 

function. Conditional mean model for outcomes with strictly positive values (e.g., counts, the numerator of incidence 
rates) often use the log link function to ensure that all predicted values will be greater than zero, i.e., log{E[y|X]} — 
p f ^ \ 

^ diXi so E[F|X] = exp I ^ 9iXi J . Conditional mean models for dichotomous outcomes (i.e., those that only take 

i=0 \i=0 / 

values 0 and 1) often use a logit link i.e., log | ^^^^Iq } = E ^'i^i, so that E[Y\X] = expit ^f] 9iX^^. This link 

ensures that all predicted values will be greater than 0 and less than 1. Conditional mean models that use the logit 
function are referred to as logistic regression models, and they are widely used in this book. For these links (referred 
to as canonical links) we can estimate 9 by maximum likelihood under a normal model for the identity link, a Poisson 
model for the log link, and a logistic regression model for the logit link. These estimates are consistent for 0 as long as 
the conditional mean model for E[y|X] is correct. Generalized estimating equation (GEE) models, often used to deal 
with repeated measures, are a further example of a conditional mean models (Liang and Zeger, 1986). 

Conditional mean models themselves can be generalized by relaxing the assumption that E[y|X] takes a parametric 
form. For example, a kernel regression model does not impose a specific functional form on E[F|X] but rather estimates 

n n 

E[y|X ~ x] for any x by E ''^ft. ^ ^i) ^/ E ^'i ^ -^i) where Wh (z) is a positive function, known as a kernel 

1=1 i=l 

function, that attains its maximum value at 2; = 0 and decreases to 0 as \z\ gets large at a rate that depends on the 

parameter h subscripting w. As another example, generalized additive models (GAMs) replace the linear combination 

p p 

E 9iXi of a conditional mean model by a sum of smooth functions E fii^i)- The model can be estimated using a 

i=0 i=0 

backfitting algorithm with /j(-) estimated at iteration k by, for example, kernel regression. (Hastie and Tibshirani 1990). 

In the text we discuss smoothing with parametric models, which specify an a priori functional form for E[F|X = x], 
such as a parabola. In estimating E[y|X = x], they may borrow information from values of X that are far from 
X. In contrast, kernel regression models do not specify an a priori functional form and borrow information only from 
values of X near to x when estimating E[y|X = x]. A kernel regression model is an example of a "non-parametric" 
regression model. This use of the term "nonparametric" differs from our previous usage. Our nonparametric estimators 
of E [Y\X — x] only used those subjects for whom X equalled x exactly; no information was borrowed even from close 
neighbors. Here "nonparametric" estimators of E[y|X = x] use subjects with values of X near to x. How near is 
controlled by a smoothing parameter referred to as the bandwidth h. Our nonparametric estimators correspond to 
taking h = 0. 



Chapter 12 

IP WEIGHTING AND MARGINAL STRUCTURAL MODELS 



Part II is organized around the causal question "what is the average causal effect of smoking cessation on body 
weight gain?" In this chapter we describe how to use IP weighting to estimate this effect using observational data. 
Though IP weighting was introduced in Chapter 2, we only described it as a nonparametric method. We now 
describe the use of models together with IP weighting which, under additional assumptions, will allow us to tackle 
high-dimensional problems with many covariates and nondichotomous treatments. 

To estimate the effect of smoking cessation on weight gain we will use real data from the NHEFS, an acronym 
that stands for (ready for a long name?) National Health and Nutrition Examination Survey Data I Epidemi- 
ologic Follow-up Study. The NHEFS was jointly initiated by the National Center for Health Statistics and the 
National Institute on Aging in collaboration with other agencies of the United States Public Health Service. A 
detailed description of the NHEFS, together with publicly available data sets and documentation, can be found at 
www.cdc.gov/nchs/nhanes/nhefs/nhefs.htm. For this and future chapters, we will use a subset of the NHEFS 
data that is available from this book's web site. We encourage readers to improve upon and refine our analyses. 



12.1 The causal question 

Our goal is to estimate the average causal effect of smoking cessation (the 
treatment) A on weight gain (the outcome) Y. To do so, we will use data 
from 1566 cigarette smokers aged 25-74 years who, as part of the NHEFS, had 
a baseline visit and a follow-up visit about 10 years later. Individuals were 
classified as treated A = 1 if they reported having quit smoking before the 
follow-up visit, and as untreated A = 0 otherwise. Each individual's weight 
gain Y was measured (in kg) as the body weight at the follow-up visit minus 
the body weight at the baseline visit. Most people gained weight, but quitters 
gained more weight on average. The average weight gain was E[F|A = 1] = 4.5 
kg in the quitters, and E[F|A = 0] = 2.0 kg in the non-quitters. The difference 
E[Y\A = 1] - E[Y\A = 0] was therefore estimated to be 2.5, with a 95% CI 
from 1.7 to 3.4. A conventional statistical test of the null hypothesis that this 
difference was equal to zero yielded a P-value< 0.001. 

We define E[y""^] as the mean weight gain that would have been observed 
if all individuals in the population had quit smoking before the follow-up visit, 
and E[y°"°] as the mean weight gain that would have been observed if all 
individuals in the population had not quit smoking. We define the average 
causal effect on the difference scale as E[F°=^] — E[y^*'], that is, the difference 
in mean weight that would have been observed if everybody had been treated 
compared with untreated. This is the causal effect that we will be primarily 
concerned with in this and the next chapters. 

The associational difference E[F|A = 1] — E[y|A = 0], which we estimated 
in the first paragraph of this section, is generally different from the causal 
difference E[y^^] — E[Y The former will not generally have a causal 
interpretation if quitters and non-quitters differ with respect to characteristics 
that affect weight gain. For example, quitters were on average 3 years older 
than non-quitters (quitters were 44% more likely to be above age 50 than non 



We restricted the analysis to 
NHEFS individuals with known sex, 
age, race, weight, height, educa- 
tion, alcohol use and intensity of 
smoking at the baseline (1971-75) 
and follow-up (1982) visits, and 
who answered the general medical 
history questionnaire at baseline. 



Table 12.1 



Mean baseline 




A 


characteristics 


1 


0 


Age, years 


46.2 


42.8 


Men, % 


54.6 


46.6 


White, % 


91.1 


85.4 


University educa- 


15.4 


9.9 


tion, % 






Weight, kg 


72.4 


70.3 


Cigarettes/day 


18.6 


21.2 


Years smoking 


26.0 


24.1 


Little or no exer- 


40.7 


37.9 


cise, % 






Inactive daily life. 


11.2 


8.9 



% 
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Fine Point 12.1 

Setting a bad example. Our smoking cessation example is convenient: it does not require deep subject-matter 
knowledge and the data are publicly available. One price we have to pay for this convenience is potential selection bias. 

We classified individuals as treated A = 1 if they reported (i) being smokers at baseline in 1971-75, and (ii) having 
quit smoking in the 1982 survey. Condition (ii) implies that the individuals included in our study did not die and were 
not otherwise lost to follow-up between baseline and 1982 (otherwise they would not have been able to respond to the 
survey). That is, we selected individuals into our study conditional on an event — responding the 1982 survey — that 
occurred after the start of the treatment — smoking cessation. If treatment affects the probability of selection into the 
study, we might have selection bias as described in Chapter 8. 

A randomized experiment of smoking cessation would not have this problem. Each individual would be assigned to 
either smoking cessation or no smoking cessation at baseline, so that their treatment group would be known even if the 
individual did not make it to the 1982 visit. In Section 12.6 we describe how to deal with potential selection bias due 
to censoring or missing data for the outcome — something that may occur in both observational studies and randomized 
experiments — but the situation described in this Fine Point is different: here the missing data concerns the treatment 
itself. This form of selection bias can be handled through sensitivity analysis, as was done in Appendix 3 of Hernan et 
al (2008). 

The choice of this example allows us to describe, in our own analysis, a ubiquitous problem in published analyses 
of observational data: treatments that start before the follow-up. Though we decided to ignore this issue in order to 
keep our analysis simple, didactic convenience would not be a good excuse to avoid dealing with this bias in real life. 



quitters), and older people gained less weight than younger people, regardless 
of whether they did or did not quit smoking. We say that age is a (surrogate) 
confounder of the effect of A on F and our analysis needs to adjust for age. The 
unadjusted estimate 2.5 might underestimate the true causal effect £[1^"*=^] — 

E[y"=0]. 

As shown in Table 12.1, quitters and non-quitters also differed in their dis- 
tribution of other variables such as sex, race, education, baseline weight, and 
intensity of smoking. If these variables are confounders, then they also need 
to be adjusted for in the analysis. In Chapter REF we discuss criteria for con- 
founder selection. Here we assume that the following 9 variables, all measured 
at baseline, are sufficient to adjust for confounding: sex (0: male, 1: female), 
age (in years), race (0: white, 1: other), education (5 categories), intensity 
and duration of smoking (number of cigarettes per day and years of smoking), 
physical activity in daily life (3 categories), recreational exercise (3 categories), 
and weight (in kg). That is, L represents a vector of 9 measured covariates. In 
the next section we use IP weighting to adjust for these covariates. 



12.2 Estimating IP weights via modeling 

IP weighting creates a pseudo-population in which the arrow from the con- 
founders L to the treatment A is removed. Thiis, if the confounders L are 
sufficient to block all backdoor paths from A to Y, then all confounding is 
eliminated in the pseudo-population. That is, the association between A and 
Y in the pseudo-population consistently estimates the causal effect of A on Y. 
Please reread Chapters 2 and 7 if you need a refresher on IP weighting. 

Informally, the pseudo-population is created by weighting each individ- 
ual by the inverse of the conditional probability of receiving the treatment 



CODE: Program 12.1 computes the 
descriptive statistics shown in this 
section 
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The conditional probability of treat- 
ment Pr[j4=l|L] is known as 
the propensity score. More about 
propensity scores in Chapter 15. 



The curse of dimensionality was in- 
troduced in Chapter 10. 



code: Program 12.2 
The estimated IP weights 
ranged from 1.05 to 16.7, and their 
mean was 2.00. 



The weighted least squares esti- 
mates 00 and 9i with weight W 
of Of) and 9i are the minimizers 

of j:,waY^-{eo + e,A,)f. If 

Wi = 1 for all subjects, we obtain 

the ordinary least squares estimates 
described in the previous chapter. 
The estimate E[Y\A = 1] = + 

6ia IS equal to ' where 

the sum is over all subjects with 
A = a. 



level that she indeed received. The individual-specific IP weights for treat- 
ment A are defined as = l/f{A\L). The denominator f {A\L) of the 
IP weight is the probability of quitting conditional on the measured con- 
founders, Pr = 1\L], for the quitters, and the probability of not quitting 
conditional on the measured confounders, Pr = 0|-L], for the non-quitters. 
For a dichotomous treatment A, we only need to estimate Pr [A = 1\L] because 
Pr [A = 0|L] = l-Pr[A= 1\L]. 

In Section 2.4 we estimated the quantity Pr = 1\L] nonparamctrically: 
we simply counted how many people were treated {A = 1) in each stratum of 
L, and then divided this count by the number of individuals in the stratum. 
All the information required for this calculation was taken from a causally 
structured tree with 4 branches (2 for L times 2 for A). But nonparametric 
estimation of Pr [A = 1\L] is out of the question when, as in our example, we 
have high-dimensional data with many confounders, some of them with multi- 
ple levels. Even if we were willing to recode all 9 confounders except age to a 
maximum of 6 categories each, our tree would still have over 2 million branches. 
And many more millions if we use the actual range of values of duration and 
intensity of smoking, and weight. We cannot obtain meaningful nonparametric 
stratum-specific estimates when there are 1566 individuals distributed across 
millions of strata. We need to resort to modeling. 

To obtain parametric estimates of Pr [^4 = 1\L] in each of the millions of 
strata defined by L, we fit a logistic regression model for the probability of 
quitting smoking with all 9 confounders included as covariates. We used linear 
and quadratic terms for the (quasi-)continuous covariates age, weight, inten- 
sity and duration of smoking, and we included no product terms between the 
covariates. That is, our model restricts the possible values of Pr [A = 1\L] such 
that, on the logit scale, the conditional relation between the continuous covari- 
ates and the risk of quitting can be represented by a parabolic curve, and each 
covariate's contribution to the risk is independent of that of the other covari- 
ates. Under these parametric restrictions, we were able to obtain an estimate 
Pr[A = 1|L] for each combination of L values, and therefore for each of the 
1566 individuals in the study population. 

The next step is computing the difference E[y|A = 1] — E[F|yl = 0] in the 
pseudo-population created by the estimated IP weights. If there is no confound- 
ing for the effect of A in the pseudo-population, association is causation and 
a consistent estimator of the associational difference E[F|A = 1] — E[y|A = 0] 
in the pseudo-population is also a consistent estimator of the causal difference 
E[ya=i] _ E[F'^=0]. To estimate E[Y\A = 1] - F.[Y\A = 0] in the pseudo- 
population, wc fit the (saturated) linear mean model E[y|A] = + OiA by 
weighted least squares, with individuals weighted by their estimated IP weights: 
1/Pr [A = 1\L] for the quitters, and 1/ ^1 - Pr = 1\L]^ for the non-quitters. 

The parameter estimate 9i was 3.4. That is, we estimated that quitting smok- 
ing increases weight by 9i = 3.4 kg on average. 

To obtain a 95% confidence interval around the point estimate 6i = 3.4 
we need a method that takes the IP weighting into account. One possibil- 
ity is to use statistical theory to derive the corresponding variance estimator. 
This approach requires that the data analyst programs the estimator, which 
is not generally available in standard statistical software. A second possibility 
is to approximate the variance by nonparametric bootstrapping (see Techni- 
cal Point 13.1). This approach requires appropriate computing resources, or 
lots of patience, for large databases. A third possibility is to use the robust 
variance estimator (e.g., as used for GEE models with an independent working 
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Technical Point 12.1 



Horvitz-Thompson estimators. In Technical Point 3.1, we defined the "apparent" IP weighted mean for treatment 

I (A = a) y 1 

which is equal to the counterfactual mean E[y] under positivity and exchangeability. This 



IP weighted mean is consistently estimated by the original Horvitz-Thompson (1952) estimator E 



I{A = a)Y' 



/(AIL) ■ 

this chapter, however, we estimated ElY""] via the IP weighted least squares estimate 9o + Oia, which is the modified 

'I{A = a)Y' 



E 



Horvitz-Thompson estimator 



1998). 



fiA\L) 



E 



I{A = a) 



used to estimate the parameters of marginal structural models (Robins 

'I {A = a)Y' 



E 



This modified Horvitz-Thompson estimator is a consistent estimator of 



f{A\L) 



E 



equal to E 



I{A = a) Y 
f{A\L) 



because E 



I{A = a) 



I{A = a) 



which, under positivity, is 



= 1 (though the original and and the modified Horvitz-Thompson 



estimators may still yield different estimates in the sample sizes observed in practice). 
On the other hand, if positivity does not hold, then 



E 



I{A = a)Y 



E 



I{A^a) 



equals 



J2 E [Y\A = a,L = l,L e Q{a)] Pr [L = l\L G Q{a)] and, if exchangeability holds, it equals E e Q{a)] ,where 

I 

Q{a) = {^;Pr {A = a\L = /) > 0} is the set of values I for which A = a may be observed with positive probability. 
Therefore, as discussed in Technical Point 3.1, the difference between modified Horvitz-Thompson estimators with 
a = 1 versus a = 0 does not have a causal interpretation in the absence of positivity. 



E[F|A] = 6*0 + 9iA is a saturated 
model because it has 2 parameters, 
9o and 9i, to estimate two quanti- 
ties, E[Y\A = 1] and E[Y\A = 0]. 
In this model, 9i = E[Y\A = 1] - 
E[y|yl = 0]. 



correlation) that is a standard option in most statistical software packages. 
The 95% confidence intervals based on the robust variance estimator are valid 
but conservative — they cover the super-population parameter more than 95% 
of the time. The conservative 95% confidence interval around 9i was (2.4,4.5). 
In this chapter, all confidence intervals for IP weighted estimates are conserv- 
ative. If the model for Pr [A = 1\L] is misspecified, the estimates of and 6i 
will be biased and, like we discuKscjd in the previous chapter, the confidence 
intervals may cover the true values less than 95% of the time. 



12.3 Stabilized IP weights 

The goal of IP weighting is to create a pseudo-population in which there is 
no association between the covariates L and treatment A. The IP weights 
= l/f{A\L) simulate a pseudo-population in which all members of the 
study population are replaced by two copies of themselves. One copy receives 
treatment value A = 1 and the other copy receives treatment value A = 0. 
In Chapter 2 we showed how the original study population in Figure 2.1 was 
transformed into the pseudo-population in Figure 2.3. The pseudo-population 
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The average causal effect in the 

treated subpopulation can be esti- 
mated by using IP weights in which 
p = Pj:[A = 1\L]. See technical 
Point 4.1. 



was twice as large as the study population because all 20 individuals were 
included both under treatment and under no treatment. Equivalently, the 

expected mean of the weights W"^ was 2. 

The IP weights = 1// (^1-^) adjust for confounding by L because they 
create a pseudo-population in which all individuals have the same probability 
of receiving A = 1 {a. probability equal to 1) and the same probability of 
receiving A = 0 (also 1). Therefore A and L are independent in the pseudo- 
population — all backdoor paths from A to the outcome Y via L are eliminated. 

However, there are other ways to create a pseudo-population in which A and 
L are independent. For example, a pseudo-population in which all individuals 
have a probability of receiving A = 1 equal to 0.5 — rather than 1 — and a 
probability of receiving A = 0 also equal to 0.5, regardless of their values of 
L. Such pseudo-population is constructed by using IP weights 0.5/ f {A\L). 
This pseudo-population would be of the same size as the study population. 
Equivalently, the expected mean of the weights 0.5// {A\L) is 1. 

The effect estimate obtained in the pseudo-population created by weights 
0.5/ f {A\L) is equal to that obtained in the pseudo-population created by 
weights 1// {A\L). (You can check this empirically by using the data in Figure 
2.1, or see the proof in Technical Point 12.2.) The same goes for any other IP 
weights p/f {A\L) with 0 < p < 1. The weights = 1/f {A\L) are just one 
particular example of IP weights with p= 1. 

Let us take our reasoning a step further. The key requirement for confound- 
ing adjustment is that, in the pseudo-population, the probability of treatment 
A does not depend on the confounders L. We can achieve this requirement 
by assigning treatment with the same probability p to everyone in the pseudo- 
population. But we can also achieve it by creating a pseudo-population in 
which different people have different probabilities of treatment, as long as the 
probability of treatment does not depend on the value of L. For example, a 
common choice is to assign to the treated the probability of receiving treatment 
Pr [A = 1] in the original population, and to the untreated the probability of 
not receiving treatment Pr [A = 0] in the original population. Thus the IP 
weights are Pr [A = 1] // (A\L) for the treated and Pr [A = 0] // {A\L) for the 
untreated or, more compactly, / {A) /f{A\L). 

Figure 12.1 shows the pseudo-population created by the weights / (A) / f {A\L) 
applied to the data in Figure 2.1, where Pr[^ = 1] = 13/20 = 0.65 and 
Pr[A = 0] = 7/20 = 0.35. Under the identifiability conditions of Chapter 
3, the pseudo-population resembles a hypothetical randomized experiment in 
which 65% of the individuals in the study population have been randomly as- 
signed to A = 1, and 35% to A = 0. Note that, to preserve the 65/35 ratio, 
the number of individuals in each branch cannot be integers. Fortunately, 
non-whole people are no big deal in mathematics. 

The weights f{A)/f{A\L) range from 0.7 to 1.4, whereas the weights 
l/f[A\L) range from 1.33 to 4. The stabilizing factor f {A) in the numer- 
ator is responsible for the narrower range of the / (A) /f{A\L) weights. The 
IP weights = l/f{A\L) are referred to as nonstabilized weights, and the 
IP weights SW^ = f (A) /f {A\L) are referred to as stabilized weights. The 
mean of the stabilized weights is expected to be 1 because the size of the 
pseudo-population equals that of the study population. 
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In real data analyses one should 
always check that the estimated 
weights SW^ have mean 1. For a 
proof of this result, see Hernan and 
Robins (2004). Deviations from 1 
indicate model misspecification or 
possible violations, or near viola- 
tions, of positivity. 



code: Program 12.3 
The estimated IP weights SW^ 
ranged from 0.33 to 4.30, and their 
mean was 1.00. 



Let us now re-estimate the effect of quitting smoking on body weight by 
using the stabihzed IP weights SW"^. First, we need an estimate of the con- 
ditional probability Pr [A = 1\L] to construct the denominator of the weights. 
We use the^same logistic model we used in Section 12.2 to obtain a parametric 
estimate Pr [A = 1\L] for each of the 1566 individuals in the study population. 
Second, we need to estimate Pr [A = 1] for the numerator of the weights. We 
can obtain a nonparametric estimate by the ratio 403/1566 or, equivalently, 
by fitting a saturated logistic model for Pr [A = 1] with an intercept and no 
covariates. Finally, we estimate the causal difference E[y~-'^] — E['K"=''] by 
fitting the mean model E[y|yl] = 9q + 6iA with individuals weighted by their 
estimated stabilized IP weights: Pr [A — 1] /Pr [A = 1\L] for the quitters, and 

for the non-quitters. Under our assump- 
tions, we estimated that quitting smoking increases weight by 9\ = 3.4 kg (95% 
CI: 2.4. 4.5) on average. This is the same estimate wc obtained earlier using 
the nonstabilized IP weights rather than the stabilized IP weights SW^. 

If nonstabilized and stabilized IP weights result in the same estimate, why 
use stabilized IP weights then? Because stabilized weights typically result in 
narrower 95% confidence intervals than nonstabilized weights. However, the 
statistical superiority of the stabilized weights can only occur when the (IP 
weighted) model is not saturated. In our above example, the two-parameter 
model E[y |A] = 9o + 9iA was saturated because treatment A could only take 
2 possible values. In many settings (e.g., time- varying or continuous treat- 
ments), the weighted model cannot possibly be saturated and therefore stabi- 
lized weights are used. The next section describes the use of stabilized weights 
for a continuous treatment. 
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Fine Point 12.2 

Checking positivity. In our study, there are 4 white women aged 66 years and none of them quit smoking. That is, the 
probability of A = 1 conditional on (a subset of) L is 0. Positivity, a condition for IP weighting, is empirically violated. 
There are two possible ways in which positivity can be violated: 

• Structural violations: The type of violations described in Chapter 3. Individuals with certain values of L cannot 
possibly be treated (or untreated). An example: when estimating the effect of exposure to certain chemicals on 
mortality, being off work is an important confounder because people off work are more likely to be sick and to die, 
and a determinant of chemical exposure — people can only be exposed to the chemical while at work. That is, the 
structure of the problem guarantees that the probability of treatment conditional on being off work is exactly 0 
(a structural zero). We'll always find zero cells when conditioning on that confounder. 

• Random violations: The type of violations described in the first paragraph of this Fine Point. Our sample is finite 
so, if we stratify on several confounders, we will start finding zero cells at some places even if the probability 
of treatment is not really zero in the target population. This is a random, not structural, violation of positivity 
because the zeroes appear randomly at different places in different samples of the target population. An example: 
our study happened to include 0 treated individuals in the strata "white women age 66" and "white women age 
67", but it included a positive number of treated individuals in the strata "white women age 65" and "white 
women age 69." 

Each type of positivity violation has different consequences. In the presence of structural violations, causal inferences 
cannot be made about the entire population using IP weighting or standardization. The inference needs to be restricted 
to strata in which structural positivity holds. See Technical Point 12.1 for details. In the presence of random violations, 
we used our parametric model to estimate the probability of treatment in the strata with random zeroes using data 
from individuals in the other strata. In other words, we use parametric models to smooth over the zeroes. For example, 
the logistic model used in Section 12.2 estimated the probability of quitting in white women aged 66 by interpolating 
from all other individuals in the study. Every time we use parametric estimation of IP weights in the presence of zero 
cells — like we did in estimating 6i = 3.4 — , we are effectively assuming random nonpositivity. 



12.4 Marginal structural models 



A (saturated) marginal structural 
mean model for a dichotomous 
treatment A. 



Consider the following linear model for the average outcome under treatment 
level a 

E[F»] =/3o+^ia 

This model is different from all models we have considered so far: the out- 
come variable of this model is counterfactual — and hence generally unobserved. 
Therefore the model cannot be fit to the data of any real-world study. Models 
for mean counterfactual outcomes are referred to as structural mean models. 
When, as in this case, the structural mean model does not include any covari- 
ates we refer to it as an unconditional or marginal structural mean model. 

The parameters for treatment in structural models correspond to average 
causal effects. In the above model, the parameter /3i is equal to E[y°"^] — 
£[^^=0] because E[y"] = jSo under a = 0 and E[F«] = /3o + A under a = 1. 
In previous sections, we have estimated the average causal effect of smoking 
cessation A on weight change Y defined as E[y"^] — E[y*"°]. In other words, 
we have estimated the parameter /3i of a marginal structural model. 

Specifically, we used IP weighting to construct a pseudo-population, and 
then fit the model E[y|A] = + 0iA to the pseudo-population data by using 
IP-weighted least squares. Under our assumptions, association is causation 
in the pseudo-population. That is, the parameter 6i from the IP-weighted 
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A desirable property of marginal 

structural models is null preserva- 
tion (see Chapter 9): when the null 
hypothesis of no average causal ef- 
fect is true, a marginal structural 
model is never misspecified. For 
example, under marginal structural 
model ElY"] = A) + + f^2a^, 
a Wald test on two degrees of free- 
dom of the joint hypothesis Pi = 
/32 = 0 is a valid test of the null 
hypothesis. 



A (nonsaturated) marginal struc- 
tural mean model for a continuous 
treatment A. 



code: Program 12.4 
The estimated SW^ ranged from 
0.19 to 5.10 with mean 1.00. We 
assumed constant variance (ho- 
moscedasticity), which seemed rea- 
sonable after inspecting a residuals 
plot. Other choices of distribution 
(e.g., truncated normal with het- 
eroscedasticity) resulted in similar 
estimates. 



associational model E[F|A] = Oq + OiA can be endowed with the same causal 

interpretation as the parameter (3i from the structural model EfF"] = /3o + 
/3i. It follows that a consistent estimate 9i of the associational parameter 
in the pseudo-population is also a consistent estimator of the causal effect 
/3i = E[F"=i] - E[F°=0] in the population. 

The marginal structural model E[Y°-] = po+pi is saturated because smok- 
ing cessation ^ is a dichotomous treatment. That is, the model has 2 unknowns 
on both sides of the equation: E[F°^^] and E[y°=''] on the left-hand side, and 
/3o and (3i on the right-hand side. Thus sample averages computed in the 
pseudo-population were enough to estimate the causal effect of interest. 

But treatments arc often polytomous or contirnious. For example, consider 
the new treatment A "change in smoking intensity" defined as number of ciga- 
rettes smoked per day in 1982 minus number of cigarettes smoked per day at 
baseline. Treatment A can now take many values, e.g., —25 if an individual 
decreased his number of daily cigarettes by 25, 40 if an individual increased 
his number of daily cigarettes by 40. Let us say that we are interested in 
estimating the difference in average weight change under different changes in 
treatment intensity in the 1162 individuals who smoked 25 or fewer cigarettes 
per day at baseline. That is, we want to estimate E[y*] — E[y" ] for any values 
a and a' . 

Because treatment A can take dozens of values, a saturated model with 
as many parameters becomes impractical. We will have to consider a non- 
satTirated structural model to specify the dosc-rcsporise curve for the effect of 
treatment A on the mean outcome Y. If we believe that a parabola appropri- 
ately describes the dose-response curve, then we would propose the marginal 
structural model 

E[Y'']=(3o+Pia + p2a^ 

where = a x a is o-squared and E[y°^'^] = /3o is the average weight gain 
under a = 0, i.e., under no change in smoking intensity between baseline and 
1982. 

Suppose we want to estimate the average causal effect of increasing smoking 
intensity by 20 cigarettes per day compared with no change, i.e., E[y=^°] — 
E[ya=0] According to our structural model, E[F"=20] = + 20^i + 400^2, 
and thus E[y°=20] - E[F"=0] = 20/3i + 400/^2. Now we need to estimate the 
parameters /3i and p2- To do so, we need to estimate IP weights SW^ to 
create a pseudo-population in which there is no confounding by L, and then 
fit the associational model E[y|j4] = 6o + 0iA + 62A'^ to the pseudo-population 
data. 

To estimate the stabilized weights SW"^ = f (A) /f {A\L) we need to es- 
timate f{A\L). For a dichotomous treatment A, f {A\L) was a probability 
and we used a logistic model to estimate Pr [A = l\L]. For a continuous treat- 
ment A, f {A\L) is a probability density function (pdf). Unfortunately, pdfs 
arc generally hard to estimate correctly, which is why using IP weighting for 
continuous treatments will often be dangerous. In our example, we assumed 
that the density / {A\L) was normal (Gaussian) with mean = E[A|L] and 
variance ct^. We then used a linear regression model to estimate the mean 
E[A|iy] and variance of residuals for all combinations of values of L. We 
also assumed that the density / (A) in the numerator was normal. One should 
be careful when using IP weighting for continuous treatments because the ef- 
fect estimates may be exquisitely sensitive to the choice of the model for the 
conditional density / (A|L). 

Our IP weighted estimates of the parameters of the marginal structural 
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This is a saturated marginal struc- 
tural logistic model for a dichoto- 
mous treatment. For a continuous 
treatment, we would specify a non- 
saturated logistic model. 



code: Program 12.5 



model were $o = 2.00, $i = -0.11, and $2 = 0.003. According to these 

estimates, the mean weight gain (95% CI) would have been 2.0 kg (1.4, 2.6) if 
all individuals have kept their smoking intensity constant, and —0.2 kg (—1.5, 
1.1) if all individuals had increased smoking by 20 cigarettes/day between 
baseline and 1982. 

One can also consider a marginal structural model for a dichotomous out- 
come. For example, if interested in the causal effect of quitting smoking A (1: 
yes, 0: no) on the risk of death D (1: yes, 0: no) by 1992, one could consider 

a marginal structural logistic model like 

logitPr[D" = 1] =ao+aia 
where exp (ai) is the causal odds ratio of death for quitting versus not quitting 

smoking. The parameters of this model are consistently estimated, under our 

assumptions, by fitting the logistic model logitPr[_D = 1|^] = + to 
the pseudo-population created by IP weighting. We estimated the causal odds 
ratio (95% CI) to be exp = 1.0 (0.8, 1.4). 



12.5 Effect modification and marginal structural models 



Also note that the parameter /Ja 
does not generally have a causal 
interpretation. Remember that we 
are assuming exchangeability, posi- 
tivity and well-defined interventions 
for treatment A, not for sex V\ 



Marginal structural models do not include covariates when the target parame- 
ter is the average causal effect in th(^ population. However, one may include 
covariates in a marginal structural model to assess effect modification. Sup- 
pose it is hypothesized that the effect of smoking cessation varies by sex V {1: 
woman, 0: man). To examine this hypothesis, we add the covariate V to our 
marginal structural mean model: 

E [Y^IV] =00+ Pia + p2Va + p^V 

Additive effect modification is present if /32 ^ 0. Technically, this is not a mar- 
ginal model any more — because it is conditional on V — but the term "marginal 
structural model" is still applied. 

Wc; can estimate the model parameters by fitting the linear regression model 
E \Y\A, V] = 7o+7i^+ 72^^^+731^ via weighted least squares with IP weights 
W'^ or SW'^. The vector of covariates L needs to include V and any other 
variables that are needed to ensure exchangeability within levels of V. 

Because we are considering a model for the effect of treatment within levels 
of V, we now have the choice to use either / [A\ or / [A\V\ in the numerar 
tor of the stabilized weights. IP weighting based on the stabilized weights 



SW"^ {V) 



generally results in a narrower confidence intervals around 



the effect estimates. Some intuition for the increased statistical efficiency of 
SW^ iy): with V in the conditioning event of both the numerator and the 
denominator, the numerical value of numerator and denominator gets closer, 
which results in added stabilization for (less variability in) the IP weights and 
therefore narrower 95% confidence intervals. We estimate SW^ {V) using the 
same approach as for SW^, except that we add the covariate V to the logistic 
model for the numerator of the weights. 

The particular subset V oi L that an investigator chooses to include the 
marginal structural model should only reflect the investigator's substantive in- 
terest. For example, a variable V should be included in the marginal structural 
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model only if the investigator both believes that V may be an effect modifier 
and has greater substantive interest in the causal effect of treatment within 
levels of the covariate V than in the entire population. In our example, we 
found no strong evidence of effect modification by sex as the 95% confidence 
CODE: Program 12.6 interval around the parameter estimate 72 was (—2.2, 1.9). If the investigator 

chooses to include all variables L in the marginal structural model, then the 
stabilized weights SW^ (L) equal 1 and no IP weighting is necessary because 
the (unweighted) outcome regression model, if correctly specified, fully adjusts 
for all confounding by L (see Chapter 15). For this reason, in a slightly hu- 
morous vein, we refer to a marginal structural model that conditions on all 
variables L needed for exchangeability as a faux marginal structural m.ode2. 

In Part I we discussed that effect modification and confounding are two 
logically distinct concepts. Nonetheless, many students have difficulty under- 
standing the distinction because the same statistical methods — stratification 
(Chapter 4) or regression (Chapter 15) — are often used both for confounder ad- 
justment and detection of effect modification. Thus, there may be some advan- 
tage to teaching these concepts using marginal structural models, because then 
methods for confounder adjustment (IP weighting) are distinct from methods 
for detection of effect modification (adding treatment-covariate product terms 
to a marginal structural model). 



12.6 Censoring and missing data 

When estimating the causal effect of smoking cessation A on weight gain K, 
we restricted the analysis to the 1566 individuals with a body weight mea- 
surement at the end of follow-up in 1982. There were, however, 63 additional 
individuals who met our eligibility criteria but were excluded from the analysis 
because their weight in 1982 was not known. Selecting only individuals with 
nonmissing outcome values — that is, censoring from the analysis those with 
missing values — may introduce selection bias, as discussed in Chapter 8. 

Let censoring C be an indicator for measurement of body weight in 1982: 
1 if body weight is unmeasured (i.e., the individual is censored), and 0 if 
body weight is measured (i.e., the individual is unccnsorcd). Our analysis 
was necessarily restricted to uncensored individuals, i.e., those with C = 0, 
because those were the only ones with known values of the outcome Y. That 
is, in sections 12.2 and 12.4 we did not fit the (weighted) outcome regression 
model E[y|A] = 6*0 + 0iA, but rather the model E[F|A, C = Q] = Oo + O^A 
restricted to individuals with C = 0. 

Unfortunately, as described in Chapter 8, selecting only uncensored indi- 
viduals for the analysis is expected to induce bias when C is either a collider 
on a pathway between treatment A and the outcome Y, or the descendant of 
one such collider. See the causal diagrams in Figures 8.3 to 8.6. Our data are 
consistent with the structure depicted by those causal diagrams: treatment A 
is associated with censoring C — 5.8% of quitters versus 3.2% nonquitters were 
censored — and at least some predictors of Y are associated with C — the aver- 
age baseline weight was 76.6 kg in the censored versus 70.8 in the uncensored. 

When censoring due to loss to follow-up can introduce selection bias, we 
turn our attention to the causal effect if nobody in the study population had 
been censored. In our example, the goal becomes estimating the mean weight 
gain if everybody had quit smoking and nobody's outcome had been censored, 
gjya=i,c=oj^ and the mean weight gain if nobody had quit smoking and no- 
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The IP weights for censoring 
and treatment are W^''^' = 
1/f {A, C = 0|L), where the joint 
density of A and C is factored 

as f{A,C^O\L) = f{A\L) x 
Pr [C = 0\L,A]. 

Some variables in L may have 

zero coefficients in the model for 
f {A\L) but not in the model 
for Pr [C = 0|-L, or vice versa. 
Nonetheless, in large samples, it is 
always more efficient to keep all 
variables L that independenty pre- 
dict the outcome in both models. 



The estimated IP weights SW^ 
have mean 1 when the model for 
Pr [C = 0\A] is correctly specified. 



code: Program 12.7 
The estimated IP weights SW'^^'-^ 
ranged from 0.35 to 4.09, and their 
mean was 1.00. 



body's outcome had been censored E[y"=''''^=°] . Then the causal effect of 
interest is E[y°=^'^="] — E[F°"'^'°=''], a joint effect of A and C as we discussed 
in Chapter 8. The use of the superscript c = 0 makes it exphcit the causal con- 
trast that many have in mind when they refer to the causal effect of treatment 
A, even if they choose not to use the superscript c = 0. 

This causal effect can be estimated by using IP weights W^'^' = x W'~' 
under exchangeability for the joint treatment {A, C) conditional on L, that is, 
ya=i,c=o ]j ii^^ (j^ j£ some of the variables in L are affected by treatment 
A, e.g., as in Figure 8.4, this conditional independence will not generally hold. 
In Part III we show that there are alternative exchangeability conditions that 
license us to use IP weighting to estimate the joint effect of A and C when 
some components of L are affected by treatment. 

Remember that the weights W'~" = 1/Pr [C = 0|i, j4] create a pseudo- 
population with the same size as that of the original study population before 
censoring, and in which there is no arrow from either L or A into C. In our 
example, the estimates of IP weights for censoring W'^ will create a pseudo- 
population with (approximately) 1566 + 63 = 1629 in which the 63 censored 
individuals arc replaced by copies of uncensored individuals with the same 
values of treatment A and covariates L. That is, we fit the weighted model 
E[y|j4, C = 0] = + 9iA with weights W^''^ to estimate the parameters of 
the marginal structural model E[F'^''=='^] — /Jq + Pia. 



Alternatively, one can use stabilized IP weights SW 



A,C 



The censoring weights SW'^ = Pr [C = 0\A] / Pr [C = 0\L, A] create a pseudo- 
population of the same size as the original study population after censoring, 
and in which there is no arrows from L into C. In our example, the estimates 
of IP weights for censoring SW*^' will create a pseudo-population of (approxi- 
mately) 1566 uncensored individuals who have the same distribution of covari- 
ates L as the 63 censored individuals not included in the pseudo-population. 
The stabilized weights do not eliminate censoring in the pseudo-population, 
they make censoring occur at random with respect to the measured covariates 
L. Therefore, under the assumption of conditional exchangeability of censored 
and uncensored individuals given L (and A), the proportion of censored indi- 
viduals in the pseudo-population is identical to that in the study population. 
That is, there is selection but no selection bias. 

To obtain parametric estimates of Pr [C = 0\L, A] in our example, we fit a 
logistic regression model for the probability of being uncensored to the 1629 
individuals in the study population. The model included the same covariates 
we used earlier to estimate the weights for treatment. Under these paramet- 
ric restrictions, we obtained an estimate Pr [C = 0|L, A] and an estimate of 
SW^ for each of the 1566 uncensored individuals. Using the stabilized weights 
^^A,c _ SW^ X SW'--'' we cstimat(xl that quitting smoking increases weight 
by Oi ~ 3.5 kg (95% CI: 2.5, 4.5) on average. This is almost the same estimate 
we obtained earlier using IP weights SW"^, which suggests that either there is 
no selection bias by censoring or that our measured covariates are unable to 
eliminate it. 

We now describe an alternative to IP weighting to adjust for confounding 
and selection bias: standardization. 
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Technical Point 12.2 



More on stabilized weights. The stabilized weights SW^ = 



f[A] 

fm 



are part of the larger class of stabilized weights 



where g [A] is any function of A that is not a function of L. When unsaturated structural models are used, 



9 [A] 

f[A\Ly 

weights ^j^l^^ .....^j^i^^ 

to construct more efficient estimators of the causal effect in a nonsaturated marginal structural model. We now show 



9 [A] 



are preferable over weights 



because there exist functions g [A] (often / [A]) that can be used 



that the IP weighted mean with weights 



First note that the IP weighted mean E 

"/(A = a)F 



l^j. is equal to the counterfactual mean E [F"]. 
I (A — alYl 

using weights 1// [A\L], which is equal to E [¥"■], can also 



E 



be expressed as 



fiA\L) 



E 



can be expressed as 



I{A = a) 

f{A\L) 
I{A^a)Y 

fiA\L) 



because E 



fiA\L) 

IjA^a) 

Jim 



= 1. Similarly, the IP weighted mean using weights 



•[A] 



f[A\L] 



E 



9{A) 



E 



I{A = a] 



viA) 



to show that the numerator E 



f{A\L) 

I{A = a)Y 



which is also equal to E [Y"']. The proof proceeds as in Technical Point 2.2 

I{A = a) 



fiA\L) 



9(A) 



= E [F"] g{a), and that the denominator E 



fiA\L) 



-9{A) 



= 9{a)- 



Chapter 13 

STANDARDIZATION AND THE PARAMETRIC G-FORMULA 



In this chapter we describe how to use standardization to estimate the average causal effect of smoking cessation 
on body weight gain. We use the sariK; observational data set as in the previous chapter. Though standardization 
was introduced in Chapter 2, we only described it as a nonparametric method. We now describe the use of models 
together with standardization, which will allow us to tackle high-dimensional problems with many covariates and 
nondichotomous treatments. We provide computer code to conduct the analyses. 

In practice, investigators will often have a choice between IP weighting and standardization as the analytic 
approach to obtain effect estimates from observational data. Both methods are based on the same identifiability 
conditions, but on different modeling assumptions. 



13.1 Standardization as an alternative to IP weighting 



As in the previous chapter, we will 
assume that the components of L 
required to adjust for C are unaf- 
fected by A. Otherwise, we would 
need to use the more general ap- 
proach described in Part III. 



In the previous chapter we estimated the average causal effect of smoking ces- 
sation A (1: yes, 0: no) on weight gain Y (measured in kg) using IP weighting. 
In this chapter we will estimate the same effect using standardization. Our 
analyses will also be based on NHEFS data from 1629 cigarette smokers aged 
25-74 years who had a baseline visit and a follow-up visit about 10 years later. 
Of these, 1566 individuals had their weight measured at the follow-up visit and 
are therefore uncensored (C = 0). 

We define E[Y'"''^"''] as the mean weight gain that would have been observed 
if all subjects had received treatment level a and if no subjects had been cen- 
sored. The average causal effect of smoking cessation can be expressed as the 
difference E[y=^''^='^] — E[y"''''^"'^], that is, the difference in mean weight 
that would have been observed if everybody had been treated and uncensored 
compared with untreated and uncensored. 

As shown in Table 12.1, quitters {A = 1) and non-quitters [A = 0) differ 
with respect to the distribution of predictors of weight gain. The observed 
associational difference E[F|A = 1,C = 0] - E[F|A = 0, C = 0] = 2.5 is 
expected to differ from the causal difference E[F"=^''^"''] — E[F"=''''^"'^]. Again 
we assume that the vector of variables L is sufficient to adjust for confounding 
and selection bias, and that L includes the baseline variables sex (0: male, 
1: female), age (in years), race (0: white, 1: other), education (5 categories), 
intensity and duration of smoking (number of cigarettes per day and years of 
smoking), physical activity in daily life (3 categories), recreational exercise (3 
categories), and weight (in kg). 

One way to adjust for the variables L is IP weighting, which creates a 
pseudo-population in which the distribution of the variables in L is the same 
in the treated and in the untreated. Then, under the assumptions of exchange- 
ability and positivity given L, we estimate E[y''==''] by simply computing 
E[y|A = a, C = 0] as the average outcome in the pseudo-population. If A 
were a continuous treatment (contrary to our example), we would also need a 
structural model to estimate E[y|A, C = 0] in the pseudo-population for all 
possible values of A. IP weighting requires estimating the joint distribution of 
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Technical Point 2.3 proves that, un- 
der conditional exchangeability and 
positivity, the standardized mean 
in the treated equals the mean if 
everyone had been treated. The ex- 
tension to censoring is trivial: just 
replace A = a by {A = a,C = 0) 
In the proof and definitions. 



The average causal effect in the 
treated can be estimated by stan- 
dardization as described in Techni- 
cal Point 4.1. One just needs to 
replace Pr[L = I] by Pr[L = l\A = 
1] In the expression to the right. 



treatment and censoring. For the dichotomous treatment smoking cessation, 
we estimated Pi [A — a,C = 0\L] and computed IP probability weights with 
this joint probabiUty in the denominator. 

As discussed in Chapter 2, an alternative to IP weighting is standardiza- 
tion. Under exchangeability and positivity conditional on the variables in L, 
the standardized mean outcome in the uncensored treated is a consistent es- 
timator of the mean outcome if everyone had been treated and had remained 
uncensored E[y^^''^^*']. Analogously, the standardized mean outcome in the 
uncensored untreated is a consistent estimator of the mean outcome if everyone 
had been untreated and had remained uncensored EfF'*^''''^^^]. 

To compute the standardized mean outcome in the uncensored treated, we 
first need to compute the mean outcomes in the uncensored treated in each stra- 
tum I of the confounders L, i.e., the conditional means E[y| A = 1,C = 0,L = I] 
in each of the strata I. In our smoking cessation example, we would need to 
compute the mean weight gain Y among those who quit smoking and remained 
uncensored in each of the (possibly millions of) strata defined by the combina- 
tion of values of the 9 variables in L. The standardized mean in the uncensored 
treated is then the weighted average of these conditional means using as weights 
the prevalence of each value / in the study population, i.e., Pr [L ^ I]. That 
is, the conditional mean from the stratum with the greatest number of indi- 
viduals has the greatest weight in the computation of the standardized mean. 
The standardized mean in the uncensored untreated is computed analogously 
except that the A = 1 in the conditioning event is replaced hy A = 0. 

More compactly, the standardized mean in the uncensored who received 
treatment level a is 

^E[Y\A = a,C = Q,L = l] xPt[L = 1] 

I 

When, as in our example, some of the variables in L are continuous, one needs 
to replace Pr [L = I] by the probability density function (pdf) /l [I], and the 
above sum becomes an integral. 

The next two sections describe how to estimate the conditional means of 
the outcome Y and the distribution of the confounders L, the two types of 
quantities required to estimate the standardized mean. 



13.2 Estimating the mean outcome via modeling 

Ideally, we would estimate the set of conditional means E[y|A = 1, (7 = 0, L = 
I] nonpararrictrically. Wc woiild compute the average outcome among the un- 
censored treated in each of the strata defined by different combination of values 
of the variables L. This is precisely what we did in Section 2.3, where all the 
information required for this calculation was taken from Table 2.2. 

But nonparametric estimation of E[y|A = 1,C = 0,L = I] is out of the 
question when, as in our current example, we have high-dimensional data with 
many confounders, some of them with multiple levels. We cannot obtain mean- 
ingful nonparametric stratum-specific estimates of the mean outcome in the 
treated when there are only 403 treated individuals distributed across millions 
of strata. We need to resort to modeling. The same rationale applies to the con- 
ditional mean outcome in the uncensored untreated E[y|A = 0,C = 0,L = I]. 

To obtain parametric estimates of E[y|^ = a, C = 0,L = I] in each of the 
millions of strata defined by L, we fit a linear regression model for the mean 
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Fine Point 13.1 

Structural positivity. Lack of structural positivity precludes the unbiased estimation of the average causal effect 
in the entire population when using IP weighting. Positivity is also necessary for standardization because, when 
Pr [A = a\L = = 0 and Pr [L = I] ^ 0, then the conditional mean outcome E[y| A = a,L = I] is undefined. 

But the practical impact of deviations from positivity may vary greatly between IP weighted and standardized 
estimates that rely on parametric models. When using standardization, one can ignore the lack of positivity if one 
is willing to rely on parametric extrapolation. That is, one can fit a model for E[y|A, L] that will smooth over the 
strata with structural zeroes. This smoothing will introduce bias into the estimation, and therefore the nominal 95% 
confidence intervals around the estimates will cover the true effect less than 95% of the time. In general, in the presence 
of violations or near-violations of positivity, the standard error of the treatment effect will be smaller for standardization 
than for IP weighting. This does not necessarily means that standardization is preferred over IP weighting; the difference 
in the biases may swamp the differences in standard errors. 



weight gain with treatment A and all 9 confounders in L included as covariates. 
Wc used linear and quadratie terms for the (quasi-) continuous covariates age, 
weight, intensity and duration of smoking. That is, our model restricts the 
possible values of E[y|A = a,C = 0, L — I] such that the conditional relation 
between the continuous covariates and the mean outcome can be represented 
by a parabolic curve. We included a product term between smoking cessation 
A and intensity of smoking. That is, our model imposes the restriction that 
each covariate's contribution to the mean is independent of that of the other 
covariates, except that the contribution of smoking cessation A varies linearly 
CODE: Program 13.1 with intensity of prior smoking. 



Under these parametric restrictions, we obtained an estimate E[F|A = 
a,C = 0,L = I] for each combination of values of A and L, and therefore 
for each of the 403 uncensored treated (A = 1,C = 0) and each of the 1163 
uncensored untreated (^4 = 0, C = 0) individuals in the study population. 
For example, we estimated that subjects with the combination of values {non- 
quitter, male, white, age 26, college dropout, 15 cigarettes/day, 12 years of 
smoking habit, moderate exercise, very active, weight 112 kg} had a mean 
weight gain of 0.34 kg (the subject with unique identifier 24770 happened to 
have these combination of values, you may take a look at his predicted value). 
In general, the standardized mean Overall, the mean of the estimated weight gain was 2.6 kg, same as the mean 
of Y is written as of the observed weight gain, and ranged from —10.9 to 9.9 kg across different 

/ E [Y\A — a,C ~ 0, L — I] dF^ {I) combinations of covariates. 

Remember that our goal is to estimate the standardized mean XI; = 
a,C = 0, L = l]x Pr [L = I] in the treated (A — 1) and in the untreated (^4 = 0). 
More formally, the standardized mean should be written as an integral because 
some of the variables in L are essentially continuous, and thus their distribution 
cannot be represented by a probability function. Regardless of these notational 
issues, we have already estimated the means E[y|A = a,C = 0,L = I] for all 
values of treatment A and confounders L. The next step is standardizing these 
means to the distribution of the confounders L for all values I. 



where Fl (•) is the joint cumulative 
distribution function (cdf) of the 
random variables in L. When, as in 
this chapter, L is a vector of base- 
line covariates unaffected by treat- 
ment, we can average over the ob- 
served values of L to nonparamet- 
rically estimate this integral. 
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13.3 Standardizing the mean outcome to the confounder distribution 
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The standardized mean is a weighted average of the conditional means E[y|A = 
a,C = 0,L = I]. When all variables in L are discrete, each mean receives a 
weight equal to the proportion of subjects with values L = I, i.e., Pr [L = I]. In 
principle, these proportions Pr [L = I] could be calculated nonparametrically 
from the data: we would divide the number of subjects in the strata defined by 
L = ^ by the total number of subjects in the population. This is precisely what 
we did in Section 2.3, where all the information required for this calculation 
was taken from Table 2.2. However, this method becomes tedious for high- 
dimensional data with many confounders, some of them with multiple levels, 
as in our smoking cessation example. 

We now describe a faster, but mathematically equivalent, method to stan- 
dardize means. Wc first apply the method to the data in Table 2.2, in which 
there was no censoring, the confounder L is only one variable with two levels, 
and y is a dichotomous outcome, i.e., the mean E[y |^ = a,L = I] is the risk 
Pr[y = Ij^l = a, L = Z] of developing the outcome. The goal is to estimate the 
standardized means J2i E[F|A = a, L = I] x Pr [L = I] in the treated {A — 1) 
and in the untreated {A = 0). The method has 4 steps: expansion of dataset, 
outcome modeling, prediction, and standardization by averaging. 

Table 2.2 has 20 rows, one per study subject. We now create a new dataset 
in which the data of Table 2.2 is copied three times. That is, the analytic 
dataset has 60 rows in three blocks of 20 individuals each. We leave the first 
block of 20 rows as is, i.e., the first block is identical to the data in Table 2.2. 
We modify the data of the second and third blocks as shown in the margin. In 
the second block, wc set the value of A to 0 (untreated) for all 20 subjects; in 
the third block we set the value of A to 1 (treated) for all subjects. In both the 
second and third blocks, we delete the data on the outcome for all subjects, 
i.e., the variable Y is assigned a missing value. As described below, we will use 
the second block to estimate the standardized mean in the untreated and the 
third block for the standardized mean in the treated. 

Next we use the 3-block dataset to fit a regression model for the mean 
outcome given treatment A and the confounder L. We add a product term 
A X L to make the model saturated. Note that only the subjects in the first 
block of the dataset (the actual data) will contribute to the estimation of the 
parameters of the model because the outcome is missing for all subjects in the 
second and third blocks. 

The next step is to use the parameter estimates from the first block to 
predict the outcome values for all rows in the second and third blocks. (That 
is, we combine data on L and A with the regression estimates to impute the 
missing value for the outcome Y.) The predicted outcome values for the second 
block are the estimates of the mean outcome for each of the combinations of 
values of L and ^ = 0, and the predicted values for the third block are the 
estimates of the mean outcome for all combinations of values of L and A = 1. 

Finally, we compute the average of all predicted values in the second block. 
Because 60% of rows have value L = 1 and 40% have value L = 0, this average 
gives more weight to rows with L = 1. That is, the average of all predicted 
values in the second block is precisely the standardized mean outcome in the 
untreated. We are done. To estimate the standardized mean outcome in the 
treated, we compute the average of all predicted values in the third block. 

The above procedure yields exactly the same estimates of the standardized 
means (0.5 for both of them) as the direct calculation in Section 2.3. Both 
approaches are completely nonpar ametric. In this chapter we did not directly 
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Technical Point 13.1 



Bootstrapping. Effect estimates are presented with measures of random variability, such as the standard error or the 
95% confidence interval, which is a function of the standard error. (We discussed the foundations of variability in Chapter 
10.) Because of the computational difficulty to obtain exact estimates, in practice standard error estimates are often 
based on large-sample approximations, which rely on asymptotic considerations. However, sometimes even large-sample 
approximations are too complicated to be calculated. The bootstrap is an alternative method for estimating standard 
errors and computing 95% confidence intervals. The simplest version of the bootstrap, which we used to compute the 
95% confidence interval around the effect estimate of smoking cessation, is sketched below. 

Take the study population of 1629 individuals. Sample with replacement 1629 individuals from the study population, 
so that some of the original individuals may appear more than once while others may not be included at all. This new 
sample of size 1629 is referred to as a "bootstrap sample." Compute the effect of interest in the bootstrap sample (e.g., 
by using standardization as described in the main text). Now create a second bootstrap sample by again sampling with 
replacement 1629 individuals. Compute the effect of interest in the second bootstrap sample using the same method 
as for the first bootstrap sample. By chance, the first and second bootstrap sample will generally include a different 
number of copies of each individual, and therefore will result in different effect estimates. Repeat the procedure in a 
large number (say, 1000) of bootstrap samples. It turns out that the standard deviation of the 1000 effect estimates in 
the bootstrap samples consistently estimates the standard error of the effect estimate in the study population. The 95% 
confidence interval is then computed by using the usual normal approximation: ±1.96 times the estimate of the standard 
error. See, for example, Wasserman (2004) for an introduction to the statistical theory underlying the bootstrap. 

We used this bootstrap method with 1000 bootstrap samples to obtain the 95% confidence interval described in 
the main text for the standardized mean difference. Though the bootstrap is a simple method, it can be computationally 
intensive for very large datasets. It is therefore common to see published estimates that are based on only 200-500 
bootstrap samples (which would have resulted in an almost identical confidence interval in our example). Finally, note 
that the bootstrap is a general method for large samples. We could have also used it to compute a 95% confidence 
interval for the IP weighted estimates from marginal structural models in the previous chapter. 



estimate the distribution of L, but rather average over the observed values of 
L, i.e., its empirical distribution. 



CODE: Program 13.3 



The use of the empirical distribution for standardizing is the way to go in 

more realistic examples, like our smoking cessation study, with high-dimensional 
L. The procedure for our study is the one described above for the data in Ta- 
ble 2.2. We add the second and third blocks to the dataset, fit the regression 
model for E[y|A = a,C = 0,L ^ I] as described in the previous section, 
and generate the predicted values. The average predicted value in the second 
block — the standardized mean in the untreated — ^was 1.65, and the average pre- 
dicted value in the third block — the standardized mean in the treated — was 



CODE: Program 13.4 



5.11. Therefore, our estimate of the causal effect E[y"=i''==°] - Efy^^o.c^Oj 
was 5.11 — 1.65 = 3.5 kg. To obtain a 95% confidence interval for this estimate 
we used a statistical technique known as the bootstrap (see Technical Point 
13.1). In summary, we estimated that quitting smoking increases body weight 
by 3.5 kg (95% CI: 2.6, 4.4). 



13.4 IP weighting or standardization? 



We have now described two ways in which modeling can be used to estimate 
the average causal effect of a treatment: IP weighting (previous chapter) and 
standardization (this chapter). In our smoking cessation example, both yielded 
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Robins (1986) described the gen- 
eralization of standardization to 
time-varying treatments and con- 
founders, and named it the g- 
computation algorithm formula, 
aka, the g-formula. 



almost exactly the same effect estimate. Indeed Technical Point 2.3 proved that 

the standardized mean equals the IP weighted mean. 

Why are we then bothering to estimate the standardized mean in this chap- 
ter if we had already estimated the IP weighted mean in the previous chapter? 
It turns out that the IP weighted and the standardized mean are only ex- 
actly equal when no models are used to estimate them. Otherwise they are 
expected to differ. To see this, consider the quantities that need to be mod- 
eled to implement either IP weighting or standardization. IP weighting mod- 
els Pr[A = a,C = 0|L], which we estimated in the previous chapter by fitting 
parametric logistic regression models for Pr [A = a\L] and Pr [C = 0\A = a,L]. 
Standardization models the conditional means E[y|^ = a,C = 0, L = I], which 
we estimated in this chapter using a parametric linear regression model. 

In practice some degree of misspecification is inescapable in all models, and 
model misspecification will introduce some bias. But the misspecification of 
the treatment model (IP weighting) and the outcome model (standardization) 
will not generally result in the same magnitude and direction of bias in the ef- 
fect estimate. Therefore the IP weighted estimate will generally differ from the 
standardized estimate because unavoidable model misspecification will affect 
the point estimates differently. Large differences between the IP weighted and 
standardized estimate will alert us to the presence of serious model misspec- 
ification in at least one of the estimates. Small differences do not guarantee 
absence of serious model misspecification, but will be reassuring — though logi- 
cally possible, it is unlikely that badly misspecified models resulting in bias of 
similar magnitude and direction for both methods. 

In our smoking cessation example, both the IP weighted and the standard- 
ized estimates arc similar. After rounding to one decimal plac(\ the estimated 
weight gain due to smoking cessation was 3.5 kg regardless of whether we fit a 
model for treatment A (IP weighting) or for the outcome Y (standardization). 
Note that in neither case we fit a model for the confoimders L, as we did not 
need the distribution of the confounders to obtain the IP weighted estimate, 
and we just used the empirical distribution of L (a nonparametric method) to 
compute the standardized estimate. 

Computing the standardized mean outcome with parametrically estimated 
conditional means is a particular case of the parametric g-formula. Because we 
were only interested in the average causal effect, we only had to estimate the 
conditional mean outcome. More generally, the parametric g-formula uses esti- 
mates of any functions of the distribution of the outcome (e.g., functionals like 
the probability density function or PDF) within levels of A and L to compute 
its standardized value. In the absence of time-varying confounders (see Part 
III), as in our example, the parametric g-formula does not require parametric 
modeling of the distribution of the confoimders. 

We used standardization to estimate the average causal effect in the entire 
population of interest. Had we been interested in the average causal effect in a 
particular subset of the population, we could have restricted our calculations 
to that subset. For example, if we had been interested in potential effect 
modification by sex, we would have estimated the standardized means in men 
and women separately. Both IP weighting and standardization can be used to 
estimate average causal effects in either the entire population or a subset of it. 

In summary, one should not choose between IP weighting and standard- 
ization when both methods can be used to answer a causal question. Just 
use both methods whenever possible. Even better, one can use doubly robust 
methods (see Technical Point 13.2) that combine models for treatment and for 
outcome in the same approach. 
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Technical Point 13.2 

Doubly robust methods. The previous chapter describes IP weighting, a method that requires a correct model for 
treatment A conditional on the confounders L. This chapter describes standardization, a method that requires a correct 
model for the outcome Y conditional on treatment A and the confounders L. How about a method that requires a 
correct model for either treatment A or outcome Y7 That is precisely what doubly robust estimation does. Under the 
usual identifiability assumptions, a doubly robust estimator consistently estimates the causal effect if at least one of the 
two models is correct (and one need not know which of the two models is correct). That is, doubly robust estimators 
give us two chances to get it right. 

There are many types of doubly robust estimators. For example, Bang and Robins (2005) proposed the following 
doubly-robust estimator for the average causal effect of a dichotomous treatment A on an outcome Y. First, estimate 
the IP weight W"^ = l/f{A\L) as described in the previous chapter. Then fit the outcome model described in this 
chapter but adding the covariate D, where D ~ \f A = 1 and D = —W^ \{ A = 0. That is, fit a model for 
E[y|A = a,C = 0,L = l,D]. Finally, use the predicted values from the model to obtain the standardized mean 
outcomes under A ^ 1 and A ~ 0. The difference of the mean standardized outcome is now doubly robust. That is, 
under exchangeability and positivity given L, this estimator consistently estimates the average causal effect if either the 
model for the treatment or for the outcome is correct. More about doubly robust methods in Chapter REF. 



13.5 How seriously do we take our estimates? 

We spent Part I of this book reviewing the definition of average causal ef- 
fect, tfic assumptions required to estimate it, and many potential biases. The 
discussion was purely conceptual, the data examples hypersimplistic. A key 
message was that the analysis of observational studies should emulate that of 
ideal randomized experiments as closely as possible. 

The analyses in this and the previous chapter are our first attempts at 
estimating causal effects from real data. Using both IP weighting and stan- 
dardization we estimated that the mean weight gain would have been 5.2 kg if 
everybody had quit smoking compared with 1.7 kg if nobody had quit smok- 
ing. Both methods estimated that quitting smoking increases weight by 3.5 kg 
(95% CI: 2.5, 4.5) on average in this particular population. In the next chap- 
ters we will see that similar estimates are obtained when using g-estimation, 
outcome regression, and propensity scores. The consistency across methods 
is reassuring because their estimates are based on different modeling assump- 
tions. However, our effect estimate is open to serious criticism. Even if we 
do not wish to transport our effect estimate to other populations (Chapter 4) 
and even if there is no interference between subjects, the validity of our esti- 
mates for the target population requires many conditions. We classify these 
conditions in three groups. 

First, the identifiability conditions of exchangeability, positivity, and well- 
defined interventions (Chapter 3) need to hold for the observational study to 
resemble a randomized experiment. The quitters and the non-quitters need to 
be exchangeable conditional on the 9 measured covariates L (see Fine Point 
14.2). Both unmeasured confounding (Chapter 7) and selection bias (Chapter 
8, Fine Point 12.2) may prevent conditional exchangeability. Positivity requires 
that the distribution of the covariates L in the quitters fully overlaps with that 
in the non-quitters (see Fine Point 12.1). Regarding well-defined interventions, 
note that there are multiple versions of both quitting smoking (e.g., quitting 
progressively, quitting abruptly) and not quitting smoking (e.g., increasing 
intensity of smoking by 2 cigarettes per day, reducing intensity but not to zero). 



At the very least, the consistency 

across methods makes it less likely 
that we had a serious programming 
error. 
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validity of our causal inferences 
uires the following conditions 

• exchangeability 

• positivity 

• well-defined interventions 

• no measurement error 

• no model misspecification 



Our effect estimate corresponds to a somewhat vague hypothetical intervention 

in the target population that randomly assigns these versions of treatment 
with the same frequency as they actually have in the study population. Other 
hypothetical interventions might result in a different effect estimate. 

Second, all variables used in the analysis need to be correctly measured. 
Measurement error in the treatment A, the outcome Y, or the confounders L 
will generally result in bias (Chapter 9). 

Third, all models used in the analysis need to be correctly specified (Chap- 
ter 11). Suppose that the correct functional form for the continuous covariate 
age in the treatment model is not the parabolic curve we used but rather a 
curve represented by a complex polynomial. Then, even if all the confoTinders 
had been correctly measured and included in L, IP weighting would not fully 
adjust for confounding. Model misspecification has a similar effect as measure- 
ment error in the confounders. 

Ensuring that each of these conditions hold, at least approximately, is the 
investigator's most important task. If these conditions could be guaranteed 
to hold, then the data analysis would be trivial. The problem is, of course, 
that one cannot ever expect that any of these conditions will hold perfectly. 
Unmeasured confounders, nonoverlapping confounder distributions, ill-defined 
interventions, mismeasured variables, and misspecified models will typically 
lurk behind our estimates. Some of these problems may be addressed em- 
pirically, but others will remain a matter of subject-matter judgement, and 
therefore open to criticism that cannot be refuted by our data. For example, 
we can propose different model specifications but we cannot adjust for variables 
that were not measured. 

Causal inferences rely on the above conditions, which are heroic and not 
empirically testable. We replace the lack of data on the distribution of the 
counterfactual outcomes by the assumption that the above conditions are ap- 
proximately met. The more our study deviates from those conditions, the 
more biased our effect estimate may be. Therefore a healthy skepticism of 
causal inferences drawn from observational data is necessary. In fact, a key 
step towards less casual causal inferences is the realization that the discTission 
should primarily revolve around each of the above assumptions. We only take 
our effects estimates as seriously as we take the conditions that are needed to 
endow them with a causal interpretation. 



Chapter 14 

G-ESTIMATION OF STRUCTURAL NESTED MODELS 



In the previous two chapters, we described IP weighting and standardization (the g-formula) to estimate the 
average causal effect of smoking cessation on body weight gain. In this chapter wc describe a third method to 
estimate the average causal effect: g-estimation. We use the same observational NHEFS data and provide simple 
computer code to conduct the analyses. 

IP weighting, the g-formula, and g-estimation are often collectively referred to as ^-methods because they 
are designed for application to generalized treatments, including time-varying treatments. Their application to 
the non-time-varying question discussed in Part II of this book may be then overkill since there are alternative 
approaches that many find simpler. However, by presenting these methods in a relatively simple setting, we can 
describe the methods while avoiding the more complex issues described in Part III. 

IP weighting and standardization were introduced in Part I (Chapter 3) and then described with models in 
Part II (Chapters 12 and 13. respectively). In contrast, wc have waited until Part II to describe g-estimation. 
There is a reason for that: describing g-estimation is facilitated by the specification of a structural model, even if 
the model is saturated. Models whose parameters are estimated via g-estimation are known as structural nested 
models. The three g-methods are based on different modeling assumptions. 



14.1 The causal question revisited 



As in previous chapters, we re- 
stricted the analysis to NHEFS indi- 
viduals with known sex, age, race, 
weight, height, education, alcohol 
use and intensity of smoking at 
the baseline (1971-75) and follow- 
up (1982) visits, and who answered 
the general medical history ques- 
tionnaire at baseline. 



Ill the last two chapters wc have applied IP weighting and standardization to 
estimate the average causal effect of smoking cessation (the treatment) A on 
weight gain (the outcome) Y. To do so, we used data from 1566 cigarette 
smokers aged 25-74 years who were classified as treated A = 1 ii they quit 
smoking, and as untreated A = 0 otherwise. We assumed that exchangeability 
of the treated and the untreated was achieved conditional on the L variables: 
sex, age, race, education, intensity and duration of smoking, physical activity 
in daily life, recreational exercise, and weight. We defined the average causal 
effect on the difference scale as E[y»=i''==0]-E[y''=0''==°], that is, the difference 
in mean weight that would have been observed if everybody had been treated 
and uncensored compared with untreated and uncensored. 

The quantity E[y*=^''==°] — E[y"=°''="°] measures the average causal ef- 
fect in the entire population. But sometimes one can be interested in the 
average causal effect in a subset of the population. For example, one may 
want to estimate the average causal effect in women — E[y"^^'°=°|w;oman] — 
gj'ya=o,c=0|y,Q^Q^j — ^ individuals aged 45, in those with low educational 
level, etc. To estimate the effect in a subset of the population one can use 
marginal structural models with product terms (see Chapter 12) or apply stan- 
dardization to that subset only (Chapter 13). 

Suppose that the investigator is interested in estimating the causal effect 
of smoking cessation A on weight gain Y in each of the strata defined by 
combinations of values of the variables L. In our example, there are many such 
strata. One of them is the stratum {non-quitter, male, white, age 26, college 
dropout, 15 cigarettes/day, 12 years of smoking habit, moderate exercise, very 
active, weight 112 kg}. As described in Chapter 4, investigators could partition 
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the study population into mutually exclusive subsets or non-overlapping strata, 

each of them defined by a particular combination of values I of the variables 
in L, and then estimate the average causal effect in each of the strata. In 
Section 12.5 we explain that an alternative approach is to add all variables 
L, together with product terms between each component of L and treatment 
A, to the marginal structural model. Then the stabilized weights SW^ (L) 
equal 1 and no IP weighting is necessary because the (unweighted) outcome 
regression model, if correctly specified, fully adjusts for all confounding by L 
(see Chapter 15). 

In this chapter we will use g-estimation to estimate the average causal effect 

of smoking cessation A on weight gain Y in each strata defined by the covari- 
ates L. This conditional effect is represented by E[Y"-^'==°\L] - E[F'^=o>'==°|L]. 
Before describing g-estimation, we will present structural nested models and 
rank preservation, and, in the next section, articulate the condition of ex- 
changeability given 1/ in a new way. 



14.2 Exchangeability revisited 



You may find the first paragraph 
of this section repetitious and un- 
necessary given our previous discus- 
sions of conditional exchangeability. 
If that is the case, we could not be 
happier. 



As a reminder (see Chapter 2), in our example, conditional exchangeability 
implies that, in any subset of the study population in which all individuals 
have the same values of L, those who did not quit smoking [A — 0) would 
have had the same mean weight gain as those who did quit smoking (A = 1) if 
they had not quit, and vice versa. In other words, conditional exchangeability 
means that the OTitcomc distribution would not differ bcrtwccn the treated 
and the untreated with the same covariate values, had they received the same 
treatment level. When the distribution of the outcomes Y" under treatment 
level a is the same for the treated and the untreated, each of the coTinterfactual 
outcomes Y°- is independent of the actual treatment level A, within levels of 
the covariates, or Y" 11 A\L for both a = 1 and a = 0. 

Take the counterfactual outcome under no treatment Y'^^'^. Under condi- 
tional exchangeability, knowing the value of Y°-^^ does not help differentiate 
between quitters and nonquitters when we also know the value of L. That 
is, the conditional (on L) probability of being a quitter is independent of the 
counterfactual outcome Y^^°. Mathematically, we write 

Pi[A = l|y"=°, L] = Pi[A = 1\L] 

which is an equivalent definition of conditional exchangeability for a binary 
treatment A. 

Expressing conditional exchangeability in terms of the conditional proba- 
bility of treatment will be helpful when we describe g-estimation later in this 
chapter. Specifically, suppose we propose the following parametric logistic 
model for the probability of treatment 



logitPr[A = l|r''=^i] = ao + aiY" 



For simplicity, we will not distin- 
guish between vector and scalar pa- 
rameters in this and subsequent 
chapters. This is an abuse of no- 
tation but we believe it does not 
create any confusion. 



where a2 is a vector of parameters, one for each component of L. If L has p 
components Li, ...Lp then = Yl^=i '^'^j^j- This model is the same one we 
used to estimate the denominator of the IP weights in Chapter 11, except that 
this model also includes the counterfactual outcome as a covariate. 

Of course, we can never fit this model to a real data set because we do 
not know the value of the variable y=o for all individuals. But suppose for 



G-estimation of structural nested models 



33 



a second that we had data on y*=o for all individuals, and that we fit the 

above logistic model. If there is conditional exchangeability and the model 
is correctly specified, what estimate would you expect for the parameter ai? 
Pause and think about it before going on (the response can be found near the 
end of this paragraph) because we will be estimating the parameter cti when 
implementing g-estimation. If you have already guessed what its value should 
be, you have already understood half of g-estimation. Yes, the expected value 
of the estimate of ai is zero because does not predict A conditional on 

L. We now introduce the other half of g-estimation: the structural model. 



14.3 Structural nested mean models 



Robins (1991) first described the 
class of structural nested models. 
These models are "nested" when 
the treatment is time-varying. See 
Part III for an explanation. 



We are interested in estimating the average causal effect of treatment A within 
levels of L, that is, E[F°=^|L] — E[y=°|L]. (For simphcity, suppose there is 
no censoring until later in this section.) Note that we can also represent this 
effect by E[y^^ — y^^'^ji] because the difference of the means is equal to the 
mean of the differences. If there were no effect-measure modification by L, 
these differences would be constant across strata, i.e., E[F°=^ — y"="|L] = /3i 
where 01 would be the average causal effect in each strata and also in the entire 
population. Our structural model for the conditional causal effect would be 

More generally, there may be effect modification by L. For example, the 
causal efi'ect of smoking cessation may be greater among heavy smokers than 
among light smokers. To allow for the causal effect to depend on L we can add a 
product term to the structural model, i.e., E[y — y^^'^ji] = fSia+f32aL, where 
/J2 is a vector of parameters. Under conditional exchangeability 1"° If j4|i, the 
conditional effect will be the same in the treated and in the untreated because 
the treated and the untreated are, on average, the same type of people within 
levels of L. Thus, under exchangeability, the structural model can also be 
written as 

E[Y"- -Y"-=°\A = a,L] =pia + l32aL 

which is referred to as a structural nested mean model. The parameters /3i and 
/32 (again, a vector) , which are estimated by g-estimation, quantify the average 
causal effect of smoking cessation A onY within levels of A and L. 

In Chapter 13 wc considered parametric models for the mean outcome Y 
that, like structural nested models, were also conditional on treatment A and 
covariates L. Those outcome models were the basis for standardization when 
estimating the parametric g-formula. In contrast with those parametric mod- 
els, structural nested models are semiparametric because they are agnostic 
about both the intercept and the main effect of L — that is, there is no parame- 
ter po and no term fS'sL. As a result of leaving these parameters unspecified, 
structural nested models make fewer assumptions and can be more robust to 
model misspecification than the parametric g-formula. See Fine Point 14.1 for 
a description of the relation between structural nested models and the marginal 
structural models of Chapter 12. 

In the presence of censoring, our causal effect of interest is not E[y"=^ — 
y=0|y4, L] but E[y=i''^=o - yo=o,c=0|^^ ^j. ^j^g average causal effect ff every- 
body had remained uncensored. Estimating this difference requires adjustment 
for both confounding and selection bias (due to censoring C = 1) for the effect 
of treatment A. As described in the previous two chapters, IP weighting and 
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Fine Point 14.1 

Relation between marginal structural models and structural nested models. Consider a marginal structural mean 
model for the average outcome under treatment level a within levels of the binary covariate V, a component of L, 

E[y«|y] = ^0 + Pia + ^2aV + psV 

The sum ^i+l32V is the average causal effect EfF'*^^ — F^^"]!/ = v] in the stratum V = v, and the sum Po+(33V is the 



mean counterfactual outcome under no treatment E[y=°|y = v] in the stratum V = v. Suppose the only inferential 

goal is the average causal effect /?i + /32W, i.e., we are not interested in estimating /3o + ^sv = EfF^^^lF = v]. Then 
we would write the model as E[y |y] = E[y=°|y] + fta + fi2a,V or, equivalently, as 

which is referred to as a semiparametric marginal structural mean model because, unlike the marginal structural models 
described in Chapter 12, leaves the mean counterfactual outcomes under no treatment E[y"°|F] completely unspeci- 
fied. 

To estimate the parameters of this semiparametric marginal structural model in the absence of censoring, we first 
create a pseudo-population with IP weights SW^{V) = f {A\V) / f {A\L) . In this pseudo-population there is only 
confounding by V and therefore the semiparametric marginal structural model is a structural nested model whose para- 
meters are estimated by g-estimation with V substituted by L and each individual's contribution weighted by SW"^{V). 
Therefore, in settings without time-varying treatments, structural nested models are identical to semiparametric mar- 
ginal structural models that leave the mean counterfactual outcomes under no treatment unspecified. Because marginal 
structural mean models include more parameters than structural nested mean models, the latter may be more robust to 
model misspecification. 

Consider the special case of a semiparametric marginal structural mean model within levels of all variables in 
L, rather than only a subset V so that SW^{V) are equal to 1 for all subjects. That is, let us consider the model 
E[ya_ya=0|£^j _ i]^a+/32aL, which we refer to as a faux semiparametric marginal structural model. Under conditional 
exchangeability, this model is the structural nested mean model we use in this chapter. 



Technically, IP weighting is not nec- 
essary for g-estimation with a non- 
time-varying treatment that does 
not affect any variable in L, and 
an outcome measured at a single 
time point. That is, if as we have 
been assuming 11 {A, C) \L, we 
can apply g-estimation to the un- 
censored subjects without having to 
IP weight. In contrast, IP weight- 
ing must be used whenever the un- 
censored and the censored are not 
exchangeable conditional on L. 



standardization can be used to adjust for these two biases. G-estimation, on 
the other hand, can only be used to adjust for confounding, not selection bias. 

Thus, when using g-estimation, one first needs to adjiist for selc;ction bias 
due to censoring by IP weighting. In practice, this means that we first estimate 
nonstabilized IP weights for censoring to create a pseudo-population in which 
nobody is censored, and then apply g-estimation to the pseudo-population. 
In our smoking cessation example, we will use the nonstabilized IP weights 
= 1/Pr [C = 0|i, A] that we estimated in Chapter 12. Again we assume 
that the vector of variables L is sufficient to adjust for both confounding and 
selection bias. 

All the g-estimation analyses described in this chapter incorporate IP weights 

to adjust for the potential selection bias due to censoring. Under the assump- 
tion that the censored and the uncensored are exchangeable conditional on the 
measured covariates L, the structural nested mean model E[Y" — = 

a, L] = j3ia + (320,L, when applied to the pseudo-population created by the IP 
weights W'-'' , is really a structural model in the absence of censoring: 



E[y'' 



,c=0 



ya- 



=0,c=0 



A = a,L] =l3ia + ^2aL 



For simplicity, we will omit the superscript c = 0 hereafter in this chapter. 
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Unlike IP weighting, g-estimation 
cannot be easily extended to es- 
timate the parameters of struc- 
tural logistic models for dichoto- 
mous outcomes. See Technical 
Point 14.1. 



In this chapter we will use g-estimation of a structural nested mean model 

to estimate the effect of the dichotomous treatment "smoking cessation" , but 
structural nested models can also be used for continuous treatment variables — 
like "change in smoking intensity" (see Chapter 12). For continuous variables, 
the model needs to specify the dose-response curve for the effect of treatment 
A on the mean outcome Y. For example, E[F'' — y^^jA = a,L] = /3ia -|- 
020^ + P^aL + Pao^L, or E[r" - F"=°|A = a, L] could be a smooth function, 
e.g., splines, of A and L. 

We now turn our attention to the concept of rank preservation, which will 
help us describe g-estimation of structural nested models. 



14.4 Rank preservation 
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Weight gain 



Figure 14.1 



In our smoking cessation example, all individuals can be ranked according to 
the value of their observed outcome Y. Subject 2352 is ranked first with weight 
gain of 48.5 kg, subject 6928 is ranked second with weight gain 47.5 kg... and 
subject 23321 is ranked last with weight gain of —41.3 kg. Similarly we could 
think of ranking all individuals according to the value of their counterfactual 
outcome under treatment y=i if the value of were known for all indi- 

viduals rather than only for those who were actually treated. But suppose for 
a second that we could rank everybody according to and also according 

to Y°-^^ . We would then have two lists of individuals ordered from larger to 
smaller value of the corresponding counterfactual outcome. If both lists are in 
identical order we say that there is rank preservation. 

When the effect of treatment A on the outcome Y is exactly the same, 
on the additive scale, for all individuals in the study population, we say that 
additive rank preservation holds. For example, if smoking cessation increases 
everybody's body weight by exactly 3 kg, then the ranking of individuals ac- 
cording to Y°-^^ would be equal to the ranking according to Y"^^^, except 
that in the latter list all individuals will be 3 kg heavier. A particular case of 
additive rank preservation occurs when the sharp null hypothesis is true (see 
Chapter 1), i.e., if treatment has no effect on the outcomes of any individual in 
the study population. For the purposes of structural nested mean models we 
will care about additive rank preservation within levels of L. This conditional 
additive rank preservation holds if the effect of treatment A on the outcome Y 
is exactly the same for all individuals with the same values of L. 

An example of an (additive conditional) rank-preserving structural model 

is 

Fj" - y"=° = ■^la -I- ii2aLi for all subjects i 

where + ip2l is the constant causal effect for all individuals with covariate 
values L = I. That is, for every individual i with L = I, the value of Yf'^^ 
is equal to 1^""° + + tp2l- A subject's counterfactual outcome under no 
treatment 5^"=" is shifted by ipi + to obtain the value of her counterfactual 
outcome imder treatment. 

Figure 14.1 shows an example of additive rank preservation within the 
stratum L = I. The bell-shaped curves represent the distribution of the coun- 
terfactual outcomes Y°-=^ (left curve) and (right curve). The two dots 
in the upper part of the figure represent the values of the two counterfactual 
outcomes for subject i, and the two dots in the lower part represent the val- 
ues of the two counterfactual outcomes for subject j. The arrows represent the 
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gain 



Figure 14.2 




Weight 

Figure 14.3 

A structural nested mean model is 
well defined in the absence of rank 
preservation. For example, one 
could propose a structural nested 
mean model for the setting depicted 
in Figure 14.3 to estimate the av- 
erage causal effect within strata of 
L. Such average causal effect will 
generally differ from the individual- 
level causal effects. 



shifts from y=o to Y"^^, which are equal to V'l +V'2^ for all individuals in this 
stratum. Figure 14.2 shows an example of rank preservation within another 
stratum L = I' . The distribution of the counterfactual outcomes is different 
from than in stratum L = I. For example, the mean of y"=o in Figure 14.1 is 
to the left of the mean of y"=o in Figure 14.2, which means that, on average, 
individuals in stratum L = I have a smaller weight gain under no smoking 
cessation than individuals in stratum L = I'. The shift from y«=o to is 
"01 + "02^' for all individuals with L = I' , as shown for individuals p and q. 

For most treatments and outcomes, the individual causal effect is not ex- 
pected to be constant — not even approximately constant — across individuals 
with the same covariate values, and thus (additive conditional) rank preserva- 
tion is scientifically implausible. In our example we do not expect that smoking 
cessation afTects equally the body weight of all individuals with the same val- 
ues of L. Some people are — genetically or otherwise — more susceptible to the 
efl'ects of smoking cessation than others, even within levels of the covariates 
L. The individual causal effect of smoking cessation will vary across people: 
after quitting smoking some individuals will gain a lot of weight, some will 
gain little, and others may even lose some weight. Reality may look more like 
the situation depicted in Figure 14.3, in which the shift from y«=o to y"=i 
varies across individuals with the same covariate values, and even ranks are 
not preserved since the outcome for individual i is less than that for individual 
j when a = 0 but not when a = 1. 

Because of the implausibility of rank preservation, one should not generally 
use methods for causal inference that rely on it. In fact none of the methods 
we consider in this book require rank preservation. For example, the marginal 
structural mean models from Chapter 12 are models for average causal effects, 
not for individual causal effects, and thus they do not assume rank preserva- 
tion. The estimated average causal effect of smoking cessation on weight gain 
was 3.5 kg (95% CI: 2.5, 4.5). This average effect is agnostic as to whether 
rank preservation of individual causal effects holds. Similarly, the structural 
nested mean model in the previous section made no assumptions about rank 
preservation. 

The additive rank-preserving model in this section makes a much stronger 
assumption than non-rank-preserving mean models: the assumption of con- 
stant treatment effect for all individuals with the same value of L. There is no 
reason why we would want to use such an unrealistic rank-preserving model 
in practice. And yet we use it in the next section to introduce g-estimation 
because g-estimation is easier to understand for rank-preserving models, and 
because the g-estimation procedure is actually the same for rank-preserving 
and non-rank-preserving models. Note that the (conditional additive) rank- 
preserving structural model is a structural mean model — the mean of the indi- 
vidual shifts from to is equal to each of the individual shifts within 
levels of L. 



14.5 G-estimation 

This section links the material in the previous three sections. Suppose the 
goal is estimating the parameters of the structural nested mean model E[y° — 
yo=0|^ = a, X] = /3ia. For simplicity, we first consider a model with a single 
parameter Because the model lacks product terms l32aL, we are effectively 
assuming that the average causal effect of smoking cessation is constant across 
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strata of L, i.e., no effect modificatioii by L. 

We also assume that the additive rank-preserving model — Y^^^'^ = ipia 
is correctly specified for all individuals i. Then the individual causal effect tpx 
is equal to the average causal effect /3i in which we are interested. We write 
the rank-preserving model as Y"" — y"=o = ipia, without a subscript i to index 
individuals because the model is the same for all individuals. For reasons that 
will soon be obvious, we write the model in the equivalent form 

yo=0 = ya _ 

The first step in g-estimation is linking the model to the observed data. To 
do so, remember that an individual's observed outcome Y is, by consistency, 
the counterfactual outcome Y°'~^ if the person received treatment A = 1 or 
the counterfactual outcome y=o if the person received no treatment A = Q. 
Therefore, if we replace the fixed value a in the structural model by each 
individual's value A — which will be 1 for some and 0 for others — then we can 
replace the counterfactual outcome Y"' by the individual's observed outcome 
Y^ = Y. The rank-preserving structural model then implies an equation 
in which each individuals's counterfactual outcome y =o is a function of his 
observed data on treatment and outcome and the unknown parameter ^i: 

ya=0 - iPiA 

If this model were correct and we knew the value of V-'i then we could calcu- 
late the counterfactual outcome under no treatment y^=o j^j. gg^^j^ individual 
in the study population. But we don't know ipi . Estimating it is precisely the 
goal of our analysis. 

Let us play a game. Suppose a friend of yours knows the value of ipi but he 
only tells you that ip\ is one of the following: = —20, = 0, or tp^ = 10. 
He challenges you: "Can you identify the true value tjji among the 3 possible 
values ip^T^ You accept the challenge. For each individual, you compute 

for each of the three possible values . The newly created variables _H"(— 20), 
i?(0), and H{10) are candidate counterfactuals. Only one of them is the coun- 
terfactual outcome F«=o. More specifically, H{ijj'^) = if = ipi. In 
this game, choosing the correct value of ipi is equivalent to choosing which 
one of the three candidate counterfactuals H{ip^) is the true counterfactual 
yo=o _ Can you think of a way to choose the right H{ip'^)l 

Remember from Section 14.2 that the assumption of conditional exchange- 
ability can be expressed as a logistic model for treatment given the counterfac- 
tual outcome and the covariates L. When conditional exchangeability holds, 
Rosenbaum (1987) proposed a ver- the parameter ai for the counterfactual outcome should be zero. So we have 
sion of this procedure for non-time- a simple method to choose the true counterfactual out of the three variables 
varying treatments. H{ip^). We fit three separate logistic models 

\ogitPr[A^l\H{tlj^),L] = ao + aiH{i)^) + a2L 

one per each of the three candidates H{ip^). The candidate H{il>^) with ai =0 

is the counterfactual Y"^^^, and the corresponding tp^ is the true value ijji. For 
example, suppose that H{tlj^ = 10) is unassociated with treatment A given 
the covariates L. Then our estimate ijji of ipi is 10. We are done. That was 
g-estimation. 



Important: G-estimation does not 
test whether conditional exchange- 
ability holds; it assumes that condi- 
tional exchangeability holds. 
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Multiplicative structural nested mean models. In the text we only consider additive structural nested mean models. 
When the outcome variable Y can only take positive values, a multiplicative structural nested mean model is preferred. 
An example of an multiplicative structural nested mean model is 



log 



E[Y''\A = a,L] 
E[y«=0|A = a,L] 



= /3ia + /32L 



-v4« ■ 



4l 



which can be fit by g-estimation with H{^j'^) defined to be Fcxp 

The above multiplicative model can also be used for binary (0, 1) outcome variables as long as the probability of 
y = 1 is small in all strata of L. Otherwise, the model might predict probabilities greater than 1. If the probability is 
not small, one can consider a structural nested logistic model for a dichotomous outcome Y such as 



logit Pr[y« = l\A = a,L]- logit Pr[y 



a=0 



l\A = a,L]=l3ia + l32L 



Unfortunately, structural nested logistic models do not generalize easily to time-varying treatments and their parameters 
cannot be estimated using the g-estimation algorithm described in the text. For details, see Tchetgen Tchetgen and 
Rotnitzky (2011). 



code: Program 14.2 



We calculated the P-value from 
a Wald test. Any other valid 
test may be used. For exam- 
ple, we could have used a Score 
test, which simplifies the calcula- 
tions (it doesn't require fitting mul- 
tiple models) and, in large samples, 
is equivalent to a Wald test. 



In practice, however, we need to g-estimate the parameter in the absence 
of a friend who knows the right answer and likes to play games. Therefore we 
will need to search over all possible values tp'^ until we find the one that results 
in an H{'ip^) with = 0. Because not all possible values can be tested — ^there 
is an infinite number of values ^p^ in any given interval — we can conduct a fine 
search over many pre-specified values (e.g., from —20 to 20 by increments 
of 0.01). The finer the search, the closer to the true estimate tpi we will get, 
but also the greater the computational demands. 

In our smoking cessation example, we first computed each individual's value 
of the 31 candidates i?(2.0), H{2A), H{2.2), ...iJ(4.9), and H{5.0) for values 
tp^ between 2.0 and 5.0 by increments of 0.1. We then fit 31 separate logistic 
models for the probability of smoking cessation. These models were exactly 
like the one used to estimate the denominator of the IP weights in Chapter 
12, except that we added to each model one of the 31 candidates H{ip^). 
The parameter estimate cti for H{ip^) was closest to zero for values H{3A) 
and H{3.5). A finer search found that the minimum value of ai (which was 
essentially zero) was for H{3A4Q). Thus, our g-estimate of the average 
causal effect ipi = (3i of smoking cessation on weight gain is 3.4 kg. 

To compute a 95% confidence interval around our g-estimate of 3.4, we used 
the P-value for a test of ai = 0 in the logistic models fit above. As expected, 
the P-value was 1 it was actually 0.998 — for = 3.446, which is the value 
tp^ that results in a candidate H{iIj^) with a parameter estimate di = 0. Of 
the 31 logistic models that we fit for values between 2.0 and 5.0, the P-value 
was greater than 0.05 in all models with H{ijA) based on tfj^ values between 
approximately 2.5 and 4.5. That is, the test did not reject the null hypothesis 
at the 5% level for the subset of ip^ values between 2.5 and 4.5. By inverting 
the test results, we concluded that the limits of the 95% confidence interval 
around 3.4 are 2.5 and 4.5. 

More generally, the 95% confidence interval for a g-estimate is determined 
by finding the set of values of ip^ that result in a P-value> 0.05 when testing for 
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Fine Point 14.2 

Sensitivity analysis for unmeasured confounding. G-estimation relies on the fact that ai = 0 // conditional 
exchangeability given L holds. Now consider a setting in which conditional exchangeability does not hold. For example, 
suppose that the probability of quitting smoking A is lower for individuals whose spouse is a smoker, and that the 
spouse's smoking status is associated with important determinants of weight gain Y not included in L. That is, 
there is unmeasured confounding by spouse's smoking status. Because now the variables in L are insufficient to achieve 
exchangeability of the treated and the untreated, the treatment A and the counterfactual y=o are associated conditional 
on L. That is, o-i 7^ 0 and we cannot apply g-estimation as described in the main text. 

But g-estimation does not require that ai = 0. Suppose that, because of unmeasured confounding by the spouse's 
smoking status, ai is expected to be 0.1 rather than 0. Then we can apply g-estimation as described in the text 
except that we will test whether ai — 0.1 rather than whether ai = 0. G-estimation does not require that conditional 
exchangeability given L holds, but that the magnitude of nonexchangeability — the value of ai — is known. This property 
of g-estimation can be used to conduct sensitivity analyses for unmeasured confounding. 

If we believe that L may not sufficiently adjust for confounding, then we can repeat our g-estimation analysis under 
different scenarios of unmeasured confounding, represented by a range of vales of ai, and plot the effect estimates under 
each of them. Such plot shows how sensitive our effect estimate is to unmeasured confounding of different direction 
and magnitude. One practical problem for this approach is how to quantify the unmeasured confounding on the ai 
scale, e.g., is 0.1 a lot of unmeasured confounding? Robins, Rotnitzky, and Scharfstein (1999) provide technical details 
on sensitivity analysis for unmeasured confounding using g-estimation. 



In the presence of censoring, the fit 
of the logistic models is necessar- 
ily restricted to uncensored individ- 
uals (C = 0), and the contribution 
of each individual is weighted by 
the estimate of his/her IP weight 
SW^. See Technical Point 14.2. 



«! = 0. The 95% confidence interval is obtained by inversion of the statistical 
test for ai = 0, with the limits of the 95% confidence interval being the limits 
of the set of values with P-value> 0.05. In our example, the statistical test 
was based on a robust variance estimator because of the use of IP weighting to 
adjust for censoring. Therefore our 95% confidence interval is conservative in 
large samples, i.e., it will trap the true value at least 95% of the time. In large 
samples, bootstrapping would result in a non-conservative, and thus possibly 
narrower, 95% confidence interval for the g-estimate. 

Back to non-rank- preserving models. The g-estimation algorithm (i.e., the 
computer code implementing the procedure) for tpi produces a consistent es- 
timate of the parameter /3i of the mean model, assuming the mean model is 
correctly specified (that is, if the average treatment effect is equal in all levels 
of L). This is true regardless of whether the individual treatment efi^ect is 
constant, that is, regardless of whether the conditional additive rank preserva- 
tion holds. In other words, the validity of the g-estimation algorithm does not 
actually require that H{pi) = for all subjects, where /?i is the parameter 

value in the mean model. Rather, the algorithm only requires that -ff(/3i) and 
have the same conditional mean given L. 



14.6 Structural nested models with two or more parameters 

We have so far considered a structural nested mean model with a single pa- 
rameter (3i. The lack of product terms (32aL imply that we believe that the 
average causal effect of smoking cessation does not vary across strata of L. The 
structural nested model will be misspecified — and thus our causal inferences 
will be wrong — if there is indeed effect modification by some components V of 
L but we failed to add a product term l32aV. This is in contrast with marginal 
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As discussed in Chapter 12, a de- 
sirable property of marginal struc- 
tural models is null preservation: 
when the null hypothesis of no aver- 
age causal effect is true, the model 
is never misspecified. Structural 
nested models preserve the null too. 
In contrast, although the g-formula 
preserves the null for non-time- 
varying treatments, it loses this 
property in the time-varying setting 
(see Part III). 



The Nelder-Mead Simplex method 
is an example of a directed search 
method. 
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You may argue that structural 
nested models with multiple para- 
meters may not be necessary. If 
all variables L are discrete and the 
study population is large, one could 
fit separate 1-parameter models to 
each subset of the population de- 
fined by a combination of values of 
L. True for fixed treatments A, but 
not true for the time-varying treat- 
ments we will discuss in Part III. 



structural models, which are not misspecified if we fail to add terms l32aV and 
(3:iV even if there is effect modification by V. Marginal structural models that 
do not condition on V estimate the average causal effect in the population, 
whereas those that condition on V estimate the average causal effect within 
levels oiV. Structural nested models estimate, by definition, the average causal 
effect within levels of the confounders L, not the average causal effect in the 
population. Omitting product terms in structural nested models when there 
is effect modification will generally lead to bias due to model misspecification. 

Fortunately, the g-estimation procedure described in the previous section 
can be generalized to models with product terms. For example, suppose we be- 
lieve that the average causal effect of smoking cessation depends on the baseline 
level of smoking intensity V. We may then consider the structural nested mean 
model E[F" — = a,L] = j3\a + p2aV and, for g-estimation purposes, 

the corresponding rank-preserving model Yf" — y^°=o = ijjia + 1JJ2O.V. Because 
the structural model has two parameters, ipi and tp2, we also need to include 
two parameters in the IP weighted logistic model for Pt[A = l\H(ip^),L]. For 
example, we could fit the logistic model 

logitPv[A=l\H{ip^),L]=ao+aiH{ip^) + a2H{ip^)V + a3L 

and find the combination of values of ipl and tpl that result in a H{ip^) that is 
independent of treatment A conditional on the covariates L. That is, we need 
to search the combination of values tpl and tpl that make both ai and a2 equal 
to zero. 

Because the model has two parameters, the search must be conducted over 
a two-dimensional space. Thus a systematic, brute force search will be more 
involved than that described in the previous section. Less computationally in- 
tensive approaches, known as directed search methods, for approximate search- 
ing are available in statistical software. For linear mean models like the one 
discussed here — but not, for example, for certain survival analysis models — 
the estimate can be directly calculated using a formula, i.e. . the estimator has 
dosed form and a search over the possible values of the parameters is not 
necessary (see Technical Point 14.2 for details). In our smoking cessation ex- 
ample, the g-estimates were 'ipi = 2.86 and 1/^2 = 0.03. The corresponding 95% 
confidence intervals can be calculated by, for example, bootstrapping. 

In the more general case, we would consider a model that allows the average 
causal effect of smoking cessation to vary across all strata of the variables 
in L. For dichotomous variables, the corresponding rank-preserving model 

— y^""" = V'lO + aX!j=i ''l^'ijLj has p + 1 parameters ipi,ip2i,...ip2p, where 
(,i'2j is the parameter corresponding to the product term aLj and Lj represents 
one of the p components of L. The average causal effect in the entire study 
population can then be calculated as ipi + j/^^^ii^^jLj, where N is the 
number of study subjects. In practice, structural nested models with multiple 
parameters have rarely been used. 

In fact, structural nested models of any type have rarely been used, partly 
because of the lack of user-fricridly software and partly becaTisc the extension 
of these models to survival analysis require some additional considerations 
(see Chapter REF). We now review two methods that are arguably the most 
connnonly used approaches to adjust for confounding: outcome regression and 
propensity scores. 
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Technical Point 14.2 

G-estimation of structural nested mean models. Consider the structural nested model E[y — F°"'^|j4 = a,L] = 
/3ia. A consistent estimate of /3i can be obtained by g-estimation under the assumptions described in the text. 
Specifically, our estimate of I3i is the value of H{tp^) that minimizes the association between H{'ip'^) and A. When we 
base our g-estimate on the score test (see, for example, Casella and Berger 2002), this procedure is equivalent to finding 
the parameter value ■^^ that solves the estimating equation 

N 

^ I [a = 0] M/f _ E [A\U]) = 0 

j=i 

where the indicator I [d = 0] takes value 1 for subject i if Cj = 0 and takes value 0 otherwise, and the IP weight Wf 
and the expectation E = Pr = 3re replaced by their estimates. E can be estimated from a logistic 

model for treatment conditional on the covariates L in which subject i contribution is weighted by Wf if Cj = 0 and 
it is zero otherwise. [Because A and L are observed on all subjects, we could also estimate E [A\Li\ by an unweighted 
logistic regression of A on L using all subjects.] 

The solution to the equation has a closed form and can therefore be calculated directly, i.e., no search over the 
parameter space is required. Specifically, using the fact that Hi(ip^) =Yi — ip^Ai we obtain that ipi equals 

N N 

J2nCi = 0] WfYi {Ai - E [A\L,]) I [Ci = 0] WfA^ {A^ - E [A\Li]) 

i=l i=l 

If ijj is D-dimensional, we multiply the left-hand side of the estimating equation by a D-dimensional vector function of L. 
The choice of the function affects the statistical efficiency of the estimator, but not its consistency. That is, although 
all choices of the function will result in valid confidence intervals, the length of the confidence interval will depend on 
the function. Robins (1994) provided a formal description of structural nested mean models, and derived the function 
that minimizes confidence interval length. 

A natural question is whether we can further increase efficiency by replacing Hi{'ip'^) by a nonlinear function, such 
as [iJj(V'^)] , in the above estimating equation and still preserve consistency of the estimate. The answer is no if 
we assumed a non-rank-preserving model because, under a non-rank-preserving model, Y°'~^ 11 A\L does not imply 
H{j3i)\l A\L, but only mean independence conditional on L, i.e., 'Ei[H{(3i)\A,L] = ¥j[H{Pi)\L]. The answer is 
yes if we assumed a (conditional linear) rank- preserving model because under a rank-preserving model Y"-^^ n A\L 
implies H{pi) U A\L. It is this latter fact, and not rank preservation per se, that allows nonlinear functions of Hi{jjj^) 
to be used in our estimating equation. 

The estimator of ^ is consistent only if the models used to estimate E [A\L\ and Pr [C = V\A, L\ are both correct. 
We can construct a more robust estimator by replacing H{%1>'^) by iJ(i/;^) — E [i?(V'^)|i] in the estimating equation, and 
then estimating the latter conditional expectation by fitting an unweighted linear model for E [i?(^''')|L] = E [y^'^ji] 
among the uncensored subjects. If this model is correct then the estimate of ^ solving the modified estimating equation 
remains consistent even if both the above models for E[A|L] and Pr[C= L] are incorrect. Thus we obtain a 
consistent estimator of ^ if either (i) the model for E or (ii) both models for E [A\L] and Pr [C = 1\A, L] are 

correct, without knowing which of (i) or (ii) is correct. We refer to such an estimator as being doubly robust. Robins 
(2000) provided a closed-form doubly robust estimator for the linear structural nested mean model. 
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Chapter 15 

OUTCOME REGRESSION AND PROPENSITY SCORES 



Outcome regression and various versions of propensity score analyses are the most commonly used parametric 
methods for causal inference. You may rightly wonder why it took us so long to include a chapter that discusses 
these methods. So far we have described IP weighting, the g-formula, and g-estimation — the g-methods. Presenting 
the most commonly used methods after the least commonly used ones seems an odd choice on our part. Why 
didn't we start with the simpler and widely used methods based on outcome regression and propensity scores? 
Because these methods do not work in general. 

More precisely, the simpler outcome regression and propensity score methods — as described in a zillion pub- 
lications that this chapter cannot possibly summarize — work fine in simpler settings, but these methods are not 
designed to handle the complexities associated with causal inference for time-varying treatments. In Part III we 
will again discuss IP weighting, the g-formula, and g-estimation but will say less about conventional outcome 
regression and propensity score methods. This chapter is devoted to causal methods that are commonly used but 
have limited applicability for complex longitudinal data. 



15.1 Outcome regression 



Reminder: We defined the aver- 
age causal effect as E[y=^''^=°] - 
gjya=o,c=0] assumed that 

exchangeability of the treated and 
the untreated was achieved condi- 
tional on the L variables sex, age, 
race, education, intensity and dura- 
tion of smoking, physical activity in 
daily life, recreational exercise, and 
weight. 



In a slightly humorous vein, we refer 
to this structural model as a faux 
marginal structural model: it has 
the form of a marginal structural 
model but IP weighting is not re- 
quired. The stabilized IP weights 
SW^{L) are all equal to 1 because 
the model is conditional on the en- 
tire vector L rather than on a sub- 
set V of L. 



In the last three chapters we have described IP weighting, standardization, 
and g-estimation to estimate the average causal effect of smoking cessation 
(the treatment) A on weight gain (the outcome) Y. We also described how to 
estimate the average causal efi:ect within subsets of the population, either by 
restricting the analysis to the subset of interest or by adding product terms in 
marginal structural models (Chapter 12) and structural nested models (Chap- 
ter 14). Take structural nested models. These models include parameters for 
the product terms between treatment A and the variables L, but no parame- 
ters for the variables L themselves. This is an attractive property of structural 
nested models because we are interested in the causal effect of A on y within 
levels of L but not in the (noncausal) relation between L and Y. A method — 
g-estimation of structural nested models — that is agnostic about the functional 
form of the L-Y relation is protected from bias due to misspecifying this rela- 
tion. 

On the other hand, if we were willing to specify the L-Y association within 
levels of A, we would consider the structural model 

E[Y"'^=°\L] = /3o + ^la + p^aL + p^L 

where /32 and /^s are vector parameters. The average causal effects of smoking 
cessation A on weight gain Y in each stratum of L are a function of (3i and /32, 
the mean counterfactual outcomes under no treatment in each stratum of L 
are a function of /3o and P^. The parameter ^3 is usually referred as the main 
effect of i, but the use of the word effect is misleading because may not 
have an interpretation as the causal effect of L (there may be confounding for 
L). The parameter /^s simply quantifies how the mean of the counterfactual 
yo=o,c=o varies as a function of L, as we can see in our structural model. See 
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Fine Point 15.1 

Nuisance parameters. Suppose our goal is to estimate the causal parameters /3iand /32- If we do so by fitting the 
outcome regression model E[y'^^'^^'^|L] = f3Q+ l3ia + /32aL + fS^L, our estimates of /3iand (32 will in general be consistent 
only if j3o + fS^L correctly models the dependence of the mean E[F"=*^'°='^|L] on L. We refer to the parameters /3o and 
/33 as nuisance parameters because they are not our parameters of primary interest. 

On the other hand, if we estimate ftand /32 by g-estimation of the structural nested model E[y°''==°— = 
(5ia + (52aL, then our estimates of /9iand fi2 will in general be consistent only if the conditional probability of treatment 
given L Pr[A = 1\L\ is correct. That is, the parameters of the treatment model such as logit Pr[A =1\L]= aQ + aiL 
are now the nuisance parameters. 

For example, bias would arise in the outcome regression model if a covariate L is model with a linear term /H^L 
when it should actually be linear and quadratic (3-iL + P^L'^ . Structural nested models are not subject to misspecification 
of an outcome regression model because the L-Y relation is not specified in the structural model. However, bias would 
arise when using g-estimation of structural nested models if the L-A relation is misspecified in the treatment model. 
Symmetrically, outcome regression models are not subject to misspecification of a treatment model. For fixed treatments 
that do not vary over time, deciding what method to use boils down to deciding which nuisance parameters — those in 
the outcome model or in the treatment model — ^we believe can be more accurately estimated. A better alternative is to 
use doubly-robust methods (see Technical Point 14.2). 



When outcome regression is an in- 
termediate step to estimate the 
mean of the counterfactual out- 
comes, correct specification of the 
dependence of ya=o.c=o ^ jg rg. 
quired. Therefore, the parameters 
/3o and fS^ become necessary too. 
(3o and /Js would also become nec- 
essary if we were interested in us- 
ing our model estimates to estimate 
the conditional (within levels of L) 
causal effect on the multiplicative 
rather than additive scale. 
CODE: Program 15.1 



Fine Point 15.1 for a discussion of parameters that, like /3o and Ps, do not have 
a causal interpretation. 

The counterfactual mean outcomes if everybody in stratum I of L had been 
treated and remained uncensored, E[y=^''==°|i = I], are equal to the corre- 
sponding mean outcomes in the imccnsorcd treated. E[y|A = 1, C = 0. Z = I], 
under exchangeability, positivity, and well-defined interventions. And analo- 
gously for the untreated. Therefore the parameters of the above structural 
model can be estimated via ordinary least squares by fitting the outcome re- 
gression model 

E[Y\A, C ^ 0,L] = ao + aiA + a2AL + a^L 

as described in Section 13.2. Like stratification in Chapter 3, outcome regres- 
sion adjusts for confounding by estimating the causal effect of treatment in 
each stratum of L. If the variables L arc sufficient to adjust for confounding 
(and selection bias) and the outcome model is correctly specified, no further 
adjustment is needed. 

In Section 13.2, outcome regression was an intermediate step towards the 
estimation of a standardized outcome mean. Here, outcome regression is the 
end of the procedure. Rather than standardizing the estimates of the condi- 
tional means to estimate a marginal mean, we just compare the conditional 
mean estimates. In Section 13.2, we fit a regression model with only one prod- 
uct term in (52 (between A and smoking intensity). That is, a model in which 
we a priori set most product terms equal to zero. Using the same model as in 
Section 13.2, here we obtained the parameter estimates /3i = 2.6 and (32 = 0.05. 
As an example, the eSect estimate E[F| A = 1, C = 0, L] -E[y | A = 0, C = 0, L] 
was 2.8 (95% CI: 1.5, 4.1) for those smoking 5 cigarettes/day, and 4.4 (95% 
CI: 2.8, 6.1) for 40 cigarettes/day. A common approach to outcome regression 
is to assume that there is no effect modification by any variable in L. Then 
the model is fit without any product terms and (3i is an estimate of both the 
conditional and marginal average causal effects of treatment. In our example, 
a model without any product terms yielded the estimate 3.5 (95% CI: 2.6, 4.3) 
kg. 
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In this chapter we did not need to explain how to fit an outcome regression 
model because we had already done it in Chapter 13 when estimating the 
components of the g-formula. It is equally straightforward to use outcome 
regression for discrete outcomes, e.g., for a dichotomous outcome Y one could 
fit a logistic model for Pr [F = 1| A = a, C = 0, i]. 



15.2 Propensity scores 



code: Program 15.2 
Here we only consider propensity 
scores for dichotomous treatments. 
Propensity score methods, other 
than IP weighting and g-estimation 
and other related doubly-robust es- 
timators, are difficult to generalize 
to non-dichotomous treatments. 
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Figure 15.1 



In the study population, due to 
sampling variability, the propen- 
sity score only approximately "bal- 
ances" the covariates L. 



When using IP weighting (Chapter 12) and g-estimation (Chapter 14), we 
estimated the probability of treatment given the covariates L, Pi [A — 1\L], 
for each individual. Let us refer to this conditional probability as p{L). The 
value of p{L) is close to 0 for individuals who have a low probability of receiving 
treatment and is close to 1 for those who have a high probability of receiving 
treatment. That is, p{L) measures the propensity of individuals to receive 
treatment given the information available; in the covariates L. No wonder that 
p{L) is referred to as the propensity score. 

In an ideal randomized trial in which half of the individuals are assigned 
to treatment A = 1, the propensity score p{L) = 0.5 for all individuals. Also 
note that p{L) = 0.5 for any choice of L. In contrast, in observational studies 
some individuals may be more likely to receive treatment than others. Be- 
cause treatment assignment is beyond the control of the investigators, the true 
propensity score p{L) is unknown, and therefore needs to be estimated from 
the data. 

In OUT example, we can estimate the propensity score p{L) by fitting a 
logistic model for the probability of quitting smoking A conditional on the 
covariates L. This is the same model that we used for IP weighting and g- 
estimation. Under this model, individual 22941 was estimated to have the 
lowest estimated propensity score (0.053), and individual 24949 the highest 
(0.793). Figure 15.1 shows the distribution of the estimated propensity score 
in quitters A = 1 (top) and nonquittcrs A = 0 (bottom). As expected, those 
who quit smoking had, on average, a greater estimated probability of quitting 
(0.312) than those who did not quit (0.245). If the distribution of p{L) were 
the same for the treated A = 1 and the untreated A = 0, then there would be 
no confounding due to L, i.e., there would be no open path from L to A on a 
causal diagram. 

Individuals with same propensity score p{L) will generally have different 
values of some covariates L. For example, two individuals with p{L) = 0.2 
may differ with respect to smoking intensity and exercise, and yet they may 
be equally likely to quit smoking given all the variables in L. That is, both 
individuals have the same conditional probability of ending up in the treated 
group A = 1. If we consider all individuals with a given value of p{L) in the 
superpopulation, this group will include individuals with different values of L 
(e.g., different values of smoking intensity and exercise), but the distribution 
of L will be the same in the treated and the untreated, that is, AUL\p{L). We 
say the propensity score balances the covariates between the treated and the 
untreated. Of course, the propensity score only balances the measured covari- 
ates L, which does not prevent residual confounding by unmeasured factors. 
Randomization balances both the mcasTired and the immcasured covariates, 
and thus it is the preferred method to eliminate confounding. See Technical 
Point 15.1 for a formal definition of a balancing score. 

Like all methods for causal inference that we have discussed, the use of 
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Technical Point 15.1 

Balancing scores and prognostic scores. As discussed in tlie text, tlie propensity score p{L) balances the covariates 
between the treated an the untreated. In fact, the propensity score p{L) is the simplest example of a balancing score. 
More generally, a balancing score b{L) is any function of the covariates L such that All L\b{L). That is, for each 
value of the balancing score, the distribution of the covariates L is the same in the treated and the untreated. Another 
example of a balancing score is the quantity A — E [A\L] that is used in g-estimation for a non-dichotomous treatment. 
Rosenbaum and Rubin (1983) proved that exchangeability and positivity based on the variables L implies exchangeability 
and positivity based on a balancing score b{L). If it is sufficient to adjust for L, then it is sufficient to adjust for a 
balancing score b{L), including the propensity score p{L). Figure 15.2 depicts the propensity score for the setting 
represented in Figure 7.1: the p{L) is an intermediate between L and A with a deterministic arrow from L to p{L). 

An alternative to a balancing score b{L) is a prognostic score s{L), i.e., a function of the covariates L such that 
Y"'^^ U L\s{L). Note that in a marginally randomized trial both the defining conditional independencies of balancing 
scores and prognostic scores {AJI L and 11 i, respectively) are expected unconditionally. Adjustment methods 
can be developed for both balancing scores and prognostic scores, but methods for prognostic scores require stronger 
assumptions and cannot be readily extended to time-varying treatments. See Hansen (2008) for a discussion of prognostic 
scores. 



propensity score methods requires the identifying conditions of exchangeabil- 
ity and positivity (besides, of course, well-defined interventions). The use of 
propensity score methods is justifed by the following key result: Exchangeabil- 
ity of the treated and the untreated within levels of the covariates L implies 
If L is sufficient to adjust for con- exchangeability within levels of the propensity score p{L). That is, conditional 
founding and selection bias, then exchangeability ]J A|i implies ]J A|p(L). Further, positivity within lev- 
p{L) is sufficient too. This result els of the propensity score p(L) — which means that no individual has a propen- 
was derived by Rosenbaum and Ru- sity score equal to either 1 or 0 — holds if and only if positivity within levels 
bin in a seminal paper published in of the covariates L, as defined in Chapter 2, holds. Under exchangeability 
1983. y° Yi ^\p{L) and positivity within levels oip{L), the propensity score can also 

be used to estimate causal effects using stratification (including outcome re- 
gression) , standardization, and matching. We now describe how to implement 
each of these methods. 



15.3 Propensity stratification and standardization 



p(L) >A >Y 

Figure 15.2 



Under exchangeability and positivity, propensity score stratification can be 
used to consistently estimate the average causal effect among individuals with 
a particular value s of the propensity score p{L), i.e., E[y*=^'°=°|p(Z/) = s] — 
gjya=o,c=0|p^^^ = s]. In its simplest form, the analysis would be restricted to 
individuals with the value s. However, the propensity score p{L) is a continuous 
variable that can take any value between 0 and 1. It is therefore unlikely that 
two individuals will have exactly the same value s. For example, only individual 
1089 had an estimated p(_L) of 0.6563, which means that we cannot estimate the 
causal effect among individuals with p{L) = 0.6563 by comparing the treated 
and the untreated with that particular value. 

In practice, propensity score stratification is carried out within strata that 
contain individuals with similar, but not identical, values of p{L). The deciles 
of the estimated p{L) is a popular choice: individuals in the population are 
classified in 10 strata of approximately equal size, then the causal effect is 
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code: Program 15.3 



Caution: the denominator of the 
IP weights for a dichotomous treat- 
ment A is not the propensity score 
p{L), but a function of p{L). The 
denominator \s p{L) for the treated 
{A = 1) and 1 - p{L) for the un- 
treated {A = 0). 



code: Program 15.4 



estimated in each of the strata. In our example, each decile contained approxi- 
mately 162 individuals. The effect of smoking cessation on weight gain ranged 
across deciles from 0.0 to 6.6 kg, but the 95% confidence intervals around these 
point estimates were wide. 

We could have also obtained these effect estimates by fitting an outcome 
regression model for E[y|A, C = 0,p{L)] that included as covariates treatment 
A, 9 indicators for the deciles of the estimated p{L) (one of the deciles is the 
reference level and is already incorporated in the intercept of the model), and 
9 product terms between A and the indicators. Most applications of outcome 
regression with deciles of the estimated p{L) do not include the product terms, 
i.e., they assume no effect modification by p{L). In our example, a model 
without product terms yields an effect estimate of 3.5 kg (95% CI: 2.6, 4.4). 

Stratification on deciles or other functions of the propensity score raises a 
potential problem: in general the distribution of the continuous p{L) will differ 
between the treated and the untreated within some strata (e.g., deciles). If, for 
example, the average p{L) were greater in the treated than in the untreated 
in some strata, then the treated and the untreated might not be exchangeable 
in those strata. This problem did not arise in previous chapters, when we 
used functions of the propensity score to estimate the parameters of structural 
models via IP weighting and g-estimation, because those methods used the 
numerical value of the estimated probability rather than a categorical transfor- 
mation like deciles. Similarly, the problem does not arise when using outcome 
regression for E[y|A, C = 0,p{L)] with the estimated propensity score p(L) 
as a continuous covariate rather than as a set of indicators. When we used 
this latter approach in our example the effect estimate was 3.6 (95% CI: 2.7, 
4.5) kg. The validity of our inference depends on the correct specification of 
the relationship between p{L) and the mean outcome Y (which we assumed to 
be linear). Note that IP weighting and g-estimation were agnostic about this 
relationship. 

When our parametric assumptions for E[F|A, C = 0,p{L)] are correct, 
plus exchangeability and positivity hold, the model estimates the average 
causal effects within all levels s of the propensity score E[y^^''^=°|p(I/) = 
s] — E[y~°''^^'^|p(L) = s]. If we were interested in the average causal effect in 
the entire study population E[y°=^''= = 0] — E[y'*=°'° = 0], we would standard- 
ize the conditional means E[y|A, C = 0,p{L)] by using the distribution of the 
propensity score. The procedure is the same one described in Chapter 13 for 
continuous variables, except that we replace the variables L by the estimated 
p{L). In our example, the standardized effect estimate was 3.6 (95% CI: 2.6, 
4.5) kg. 



15.4 Propensity matching 

Matching on the propensity score p{L) is analogous to matching on a variable 
L, a procedure described in Chapter 4. There are many forms of propensity 
Propensity matching is conducted matching. All of them attempt to form a matched population in which the 
in such a way that the matched treated and the untreated are exchangeable because they have the same distri- 
population ends up having the p{L) bution of p{L). For example, one can match the untreated to the treated: each 
distribution of the untreated, the treated individual is paired with one (or more) imtreated individuals with the 
entire population, or any other ar- same propensity score value. The subset of the original population comprised 
bitrary distribution. by the treated-untreated pairs (or sets) is the matched population. Under ex- 

changeability and positivity given p{L), association measures in the matched 
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A drawback of matching used to be 
that nobody knew how to compute 
the variance of the effect estimate. 
That is no longer the case thanks 
to the work of Abadie and Imbens 
(2006). 



Remember: positivity is now de- 
fined within levels of the propensity 
score, i.e., Pr [A = a\P{L) = p] > 
0 for all p such that Pr [P{L) = p] 



IS nonzero. 



population are consistent estimates of effect measures, e.g., the associational 
risk ratio in the matched population consistently estimates the causal risk ratio 
in the matched population. 

Again, it is unlikely that two individuals will have exactly the same values 
of the propensity score p{L). In our example, propensity score matching will be 
carried out by identifying, for each treated individual, one (or more) untreated 
individuals with a close value oip{L). A common approach is to match treated 
individuals with a value .s of the estimated p{L) with untreated individuals who 
have a value s ± 0.05, or some other small difference. For example, treated 
subject 1089 (estimated p{L) of 0.6563) might be matched with untreated 
subject 1088 (estimated p{L) of 0.6579). There are numerous ways of defining 
closeness, and a detailed description of these definitions is beyond the scope of 
this book. 

Defining closeness in propensity matching entails a bias-variance trade- 
off. If the closeness criteria are too loose, individuals with relatively different 
values of p{L) will be matched to each other, the distribution of p{L) will 
differ between the treated and the uritreatc^d in the matched population, and 
exchangeability will not hold. On the other hand, if the closeness criteria are 
too tight and many individuals are excluded by the matching procedure, there 
will be approximate exchangeability but the effect estimate may have wider 
95% confidence intervals. 

The definition of closeness is also related to that of positivity. In our smok- 
ing cessation example, the distributions of the estimated p{L) in the treated 
and the untreated overlapped throughout most of the range (see Figure 15.1). 
Only 2 treated individuals (0.01% of the study population) had values greater 
than those of any untreated individual. When using outcome regression on the 
estimated p{L) in the previous section, we effectively assumed that the lack 
of untreated individuals with high p{L) estimates was due to chance — random 
nonpositivity — and thus included all subjects in the analysis. In contrast, 
most propensity matched analyses would not consider those 2 treated individ- 
uals close enough to any of the untreated individuals, and would exclude them. 
Matching does not distinguish between random and structural nonpositivity. 

The above discussion illustrates how the matched population may be very 
different from the target (super)population. In theory, propensity matching 
can be used to estimate the causal effect in a well characterized target pop- 
ulation. For example, when matching each treated individual with one or 
more untreated individuals and excluding the unmatched untreated, one is es- 
timating the effect in the treated (see Fine Point 15.2). In practice, however, 
propensity matching may yield an effect estimate in a hard-to-describe subset 
of the study population. For example, under a given definition of closeness, 
some treated individuals cannot be matched with any untreated individuals 
and thus they are excluded from the analysis. As a result, the effect estimate 
corresponds to a subset of the population that is defined by the values of the 
estimated propensity score that have successful matches. 

That propensity matching forces investigators to restrict the analysis to 
treatment groups with overlapping distributions of the estimated propensity 
score is often presented as a strength of the method. One surely would not want 
to have biased estimates because of violations of positivity, right? However, 
leaving aside issues related to random variability (see above), there is a price 
to be paid for restrictions based on the propensity score. Suppose that, after 
inspecting Figure 15.1, we conclude that we can only estimate the effect of 
smoking cessation for individuals with an estimated propensity score less than 
0.67. Who are these people? It is unclear because individuals do not come with 
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Fine Point 15.2 

Effect modification and the propensity score. A reason why matched and unmatched estimates may differ is effect 
modification. As an example, consider the common setting in which the number of untreated individuals is much 
larger than the number of treated individuals. Propensity matching often results in almost all treated individuals being 
matched and many untreated individuals being unmatched and therefore excluded from the analysis. When this occurs, 
the distribution of causal effect modifiers in the matched population will resemble that in the treated. Therefore, the 
effect in the matched population will be closer to the effect in the treated than to the effect that would have been 
estimated by methods that use data from the entire population. 

An advantage in using the propensity score is that effect modification may be more obvious across strata of the 
estimated propensity score p{L) than across strata of the individual variables L. Also, as discussed by Kurth et al 
(2006), effect modification across propensity strata may be interpreted as evidence that doctors know what they are 
doing, i.e. that they tend to treat patients who are more likely to benefit from treatment. However, the presence of 
effect modification by p{L) may complicate the interpretation of the estimates. Consider a situation with qualitative 
effect modification: "Doctor, according to our study, this drug is beneficial for patients who have a propensity score 
between 0.11 and 0.93 when they arrive at your office, but it may kill those with propensity scores below 0.11," or 
"Ms. Minister, let's apply this educational intervention to children with propensity scores below 0.57 only." The above 
statements are of little policy relevance because, as discussed in the main text, they are not expressed in terms of the 
measured variables L. 

Finally, besides effect modification, there are other reasons why matched estimates may differ from the overall effect 
estimate: violations of positivity in the non-matched, an unmeasured confounder that is more/less prevalent (or that is 
better/worse measured) in the matched population than in the unmatched population, etc. As discussed for individuals 
variables L in Chapter 4, remember that effect modification might be explained by differential mismeasurement of 
confounders across propensity strata. 



Even if every subject came with 
her propensity score tattooed on 
her forehead, the population could 
still be ill-characterized because the 
same propensity score value may 
mean different things in different 
settings. 



a propensity score tattooed on their forehead. Because the matched population 
is not well characterized, it is hard to assess the transportability of the effect 
estimate to other populations. 

When positivity concerns arise, restriction based on real- world variables 
(e.g., age, number of cigarettes) leads to a more natural characterization of the 
causal effect. In our smoking cessation example, the two treated individuals 
with estimated p{L) > 0.67 were the only ones in the study who were over 
age 50 and had smoked for less than 10 years. We could exclude them and 
explain that our effect estimate only applies to smokers under age 50 and to 
smokers 50 and over who had smoked for at least 10 years. This way of defining 
the target population is more natural than defining it as those with estimated 
p{L) < 0.67. 

Using propensity scores to detect the overlapping range of the treated and 
the untreated may be useful, but simply restricting the study population to 
that range is a lazy way to ensure positivity. The automatic positivity ensured 
by propensity matching needs to be weighed against the difSculty of assessing 
transportability when restriction is solely based on the value of the estimated 
propensity scores. 



15.5 Propensity models, structural models, predictive models 



In Part II of this book we have described two different types of models for causal 
inference: propensity models and structural models. Let us now compare them. 
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For non-time-varying dichotomous 
treatnnents, IP weighting and g- 
estimation are based on models for 
treatment whicli are precisely the 
same propensity models discussed 
in this chapter. 



Refer back to Fine Point 14.1 for a 
discussion of the relation between 
structural nested models and faux 
semiparametric marginal structural 
models, and other subtleties. 



In a purely predictive study, Face- 
book Likes were found to predict 
sexual orientation, ethnicity, reli- 
gion, political views, and person- 
ality traits (Kosinski et al, 2013). 
Low intelligence was predicted by, 
among other things, a "Harley 
Davidson" Like. 

Remember; This is all about pre- 
diction. The authors do not suggest 
that these associations are causal, 
and neither do we. 



Propensity models are models for the probability of treatment A given 
the variables L used to try to achieve conditional exchangeability. We have 
used propensity models for matching and stratification in this chapter, for IP 
weighting in Chapter 12, and for g-estimation in Chapter 14. The parameters 
of propensity models are nuisance parameters (see Fine Point 15.1) without a 
causal interpretation because a variable L and treatment A may be associated 
for many reasons — not only because the variable L causes A. For example, 
the association between L and A can be interpreted as the effect of L on 
under Figure 7.1, but not under Figure 7.2. Yet propensity models are useful 
for causal inference, often as the basis of the estimation of the parameters of 
structural models, as we have described in this and previous chapters. 

Structural models describe the relation between the treatment A and some 
component of the distribution (e.g., the mean) of the counterfactual outcome 
either marginally or within levels of the variables L. For continuous treat- 
ments, a structural model is often referred to as a dose-response model. The 
parameters for treatment in structural models are not nuisance parameters: 
they have a direct causal interpretation as outcome differences under differ- 
ent treatment values a. We have described two classes of structural models: 
marginal structural models and structural nested models. Marginal structural 
models include parameters for treatment, for the variables V that may be ef- 
fect modifiers, and for product terms between treatment and variables V. The 
choice of V reflects only the investigator's substantive interest in effect mod- 
ification (see Section 12.5). If no covariatcs V are included, then the model 
is truly marginal. If all variables L are included as possible effect modifiers, 
then the marginal structural model becomes a faux marginal structural model. 
Structural nested models include parameters for treatment and for product 
terms between treatment A and all variables in L that are effect modifiers. 

We have presented outcome regression as a method to estimate the para- 
meters of faiix marginal structural models for caiisal inference. However, out- 
come regression is also widely used for purely predictive, as opposed to causal, 
purposes. For example, online retailers use sophisticated outcome regression 
models to predict which customers are more likely to purchase their products. 
The goal is not to determine whether your age, sex, income, geographic origin, 
and previous purchases have a causal effect on your current purchase. Rather, 
the goal is to identify those customers who arc more likely to make a purchase 
so that specific marketing programs can be targeted to them. It is all about 
association, not causation. Similarly, doctors use algorithms based on outcome 
regression to identify patients at high risk of developing a serious disease or 
dying. The parameters of these predictive models do not necessarily have any 
causal interpretation and all covariates in the model have the same status, i.e., 
there are no treatment variable A and variables L. 

The dual use of outcome regression in both causal inference method and 
in prediction has led to many misunderstandings. One of the most impor- 
tant misunderstandings has to do with variable selection procedures. When 
the interest lies exclusively on outcome prediction, investigators want to select 
any variables that, when included as covariates in the model, improve its pre- 
dictive ability. Many well-known variable selection procedures — e.g., forward 
selection, backward elimination, stepwise selection — and more recent develop- 
ments in machine learning are used to enhance prediction. These are powerful 
tools for investigators who are interested in prediction, especially when dealing 
with very high-dimensional data. 

Unfortunately, statistics courses and textbooks have not always made a 
sharp difference between causal inference and prediction. As a result, these 
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It is not uncommon for propen- 
sity analyses to report measures of 
predictive power like Mallows's Cp. 
The relevance of these measures for 
causal inference is questionable. 



variable selection procedures for predictive models have often been applied to 

causal inference models. A possible result of this mismatch is the inclusion of 
superfluous — or even harmful — covariates in propensity models and structural 
models. Specifically, the application of predictive algorithms to causal inference 
models may result in inflated variances, as discussed in Chapter REF. 

One of the reasons for variance inflation is the widespread, but mistaken, 
belief that propensity models should predict treatment A as well as possible. 
Propensity models do not need to predict treatment very well. They just need 
to include the variables L that guarantee exchangeability. Covariates that 
are strongly associated with treatment, but are not necessary to guarantee 
exchangeability, do not help reduce bias. If these covariates were included in 
L, adjustment can actually result in estimates with very large variances. 

Consider the following example. Suppose all individuals in certain study 
attend either hospital Aceso or hospital Panacea. Doctors in hospital Aceso 
give treatment A = 1 to 99% of the individuals, and those in hospital Panacea 
give A = 0 to 99% of the individuals. Suppose the variable Hospital has 
no effect on the outcome (except through its effect on treatment A) and is 
therefore not necessary to achieve conditional exchangeability. Say we decide 
to add Hospital as a covariate in our propensity model anyway. The propensity 
score p{L) in the target population is at least 0.99 for everybody in the study, 
but by chance we may end up with a study population in which everybody 
in hospital Aceso has A = 1 or everybody in hospital Panacea has ^4 = 0 for 
some strata deflned by L. That is, our effect estimate may have a near-infinite 
variance without any reduction in confounding. That treatment is now very 
well predicted is irrelevant for causal inference purposes. 

Besides variance inflation, a predictive attitude towards variable selection 
for causal inference models — both propensity models and outcome regression 
models — may also result in self-inflicted bias. For example, the inclusion of 
covariates strongly associated with treatment in propensity models may result 
in small-sample bias and the inclusion of colliders as covariates may result 
in systematic bias. Colliders, however, may be cfi'cctive covariates for purely 
predictive purposes. Wc will return to these issues in Chapter REF. 

All causal infcreiico methods based on models propensity models and 
structural models — require no misspecification of the functional form of the 
covariates. To reduce the possibility of model misspecification, we use fiexible 
specifications, e.g., cubic splines rather than linear terms. In addition, these 
causal inference methods require the conditions of exchangeability, positivity, 
and well-defined interventions for unbiased causal inferences. In the next chap- 
ter we describe a very different type of causal inference method that does not 
require exchangeability as we know it. 
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